Skip to content
View Video-Bench's full-sized avatar
  • Joined Dec 9, 2024

Block or report Video-Bench

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Video-Bench/README.md

Video-Bench: Human Preference Aligned Video Generation Benchmark

HABench is a benchmark tool designed to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, HA-Video-Bench provides a structured, scalable approach to generated video evaluation.

Multi-Modal Foundation-Model Foundation-Model Video-Understanding Video-Generation Video-Recommendation Video-Recommendation Video-Recommendation

⭐Overview | 📒Leaderboard | 🤗HumanAlignment | 🛠️Installation | 🗃️Preparation | ⚡Instructions | 🚀Usage | 📭Citation | 📝Literature

Contents

Overview

Leaderboard

Model Imaging Quality Aesthetic Quality Temporal Consist. Motion Effects Avg Rank Video-text Consist. Object-class Consist. Color Consist. Action Consist. Scene Consist. Avg Rank Overall Avg Rank
Cogvideox [57] 3.87 3.84 4.14 3.55 3.00 4.62 2.81 2.92 2.81 2.93 1.60 2.22
Gen3 [42] 4.66 4.44 4.74 3.99 1.00 4.38 2.81 2.87 2.59 2.93 2.40 1.78
Kling [24] 4.26 3.82 4.38 3.11 2.75 4.07 2.70 2.81 2.50 2.82 4.60 3.78
VideoCrafter2 [5] 4.08 3.85 3.69 2.81 3.75 4.18 2.85 2.90 2.53 2.78 2.80 3.22
LaVie [52] 3.00 2.94 3.00 2.43 7.00 3.71 2.82 2.81 2.45 2.63 5.00 5.88
PiKa-Beta [38] 3.78 3.76 3.40 2.59 5.50 3.78 2.51 2.52 2.25 2.60 6.80 6.22
Show-1 [60] 3.30 3.28 3.90 2.90 5.00 4.21 2.82 2.79 2.53 2.72 3.80 4.33

Notes:

  • Higher scores indicate better performance.
  • The best score in each dimension is highlighted in bold.

HumanAlignment

Metrics Benchmark Imaging Quality Aesthetic Quality Temporal Consist. Motion Effects Video-text Consist. Action Consist. Object-class Consist. Color Consist. Scene Consist.
MUSIQ [21] VBench [19] 0.363 - - - - - - - -
LAION VBench [19] - 0.446 - - - - - - -
CLIP [40] VBench [19] - - 0.260 - - - - - -
RAFT [48] VBench [19] - - - 0.329 - - - - -
Amt [28] VBench [19] - - - 0.329 - - - - -
ViCLIP [53] VBench [19] - - - - 0.445 - - - -
UMT [27] VBench [19] - - - - - 0.411 - - -
GRiT [54] VBench [19] - - - - - - 0.469 0.545 -
Tag2Text [16] VBench [19] - - - - - - - - 0.422
ComBench [46] ComBench [46] - - - - 0.633 0.633 0.611 0.696 0.631
Video-Bench Video-Bench 0.733 0.702 0.402 0.514 0.732 0.718 0.735 0.750 0.733

Notes:

  • Higher scores indicate better performance.
  • The best score in each dimension is highlighted in bold.

Installation

Installation Requirements

  • Python >= 3.8
  • OpenAI API access Update your OpenAI API keys in config.json:
    {
        "GPT4o_API_KEY": "your-api-key",
        "GPT4o_BASE_URL": "your-base-url",
        "GPT4o_mini_API_KEY": "your-mini-api-key",
        "GPT4o_mini_BASE_URL": "your-mini-base-url"
    }

Pip Installation

  • Install with pip

    pip install HAbench
  • Install with git clone

    git clone https://github.com/yourusername/Video-Bench.git
    cd Video-Bench
    pip install -r requirements.txt

Download From Huggingface

wget https://huggingface.co/Video-Bench/Video-Bench -O ./pytorch_model.bin

or

curl -L https://huggingface.co/Video-Bench/Video-Bench -o ./pytorch_model.bin

Preparation

Please organize your data according to the following data structure:

# Data Structure
/Video-Bench/data/
├── color/                           # 'color' dimension videos
│   ├── cogvideox5b/
│   │   ├── A red bird_0.mp4
│   │   ├── A red bird_1.mp4
│   │   └── ...
│   ├── lavie/
│   │   ├── A red bird_0.mp4
│   │   ├── A red bird_1.mp4
│   │   └── ...
│   ├── pika/
│   │   └── ...
│   └── ...
│
├── object_class/                    # 'object_class' dimension videos
│   ├── cogvideox5b/
│   │   ├── A train_0.mp4
│   │   ├── A train_1.mp4
│   │   └── ...
│   ├── lavie/
│   │   └── ...
│   └── ...
│
├── scene/                           # 'scene' dimension videos
│   ├── cogvideox5b/
│   │   ├── Botanical garden_0.mp4
│   │   ├── Botanical garden_1.mp4
│   │   └── ...
│   └── ...
│
├── action/                          # 'action' 'temporal_consistency' 'motion_effects' dimension videos
│   ├── cogvideox5b/
│   │   ├── A person is marching_0.mp4
│   │   ├── A person is marching_1.mp4
│   │   └── ...
│   └── ...
│
└── video-text consistency/             # 'video-text consistency' 'imaging_quality' 'aesthetic_quality' dimension videos
    ├── cogvideox5b/
    │   ├── Close up of grapes on a rotating table._0.mp4
    │   └── ...
    ├── lavie/
    │   └── ...
    ├── pika/
    │   └── ...
    └── ...

Instructions

Video-Bench provides comprehensive evaluation across multiple dimensions of video generation quality. Each dimension is assessed using a specific scoring scale to ensure accurate and meaningful evaluation.

Evaluation Dimensions

Dimension Description Scale Module
Static Quality
Image Quality Evaluates technical aspects including clarity and sharpness 1-5 staticquality.py
Aesthetic Quality Assesses visual appeal and artistic composition 1-5 staticquality.py
Dynamic Quality
Temporal Consistency Measures frame-to-frame coherence and smoothness 1-5 dynamicquality.py
Motion Effects Evaluates quality of movement and dynamics 1-5 dynamicquality.py
Video-Text Alignment
Video-Text Consistency Overall alignment with text prompt 1-5 VideoTextAlignment.py
Object-Class Consistency Accuracy of object representation 1-3 VideoTextAlignment.py
Color Consistency Matching of colors with text prompt 1-3 VideoTextAlignment.py
Action Consistency Accuracy of depicted actions 1-3 VideoTextAlignment.py
Scene Consistency Correctness of scene environment 1-3 VideoTextAlignment.py

Usage

Video-Bench supports two modes: standard mode and custom input mode. Video-Bench only supports assessments of the following dimensions: 'aesthetic_quality', 'imaging_quality','temporal_consistency', 'motion_effects','color', 'object_class', 'scene', 'action', 'video-text consistency'

Standard Mode

This evaluation mode assesses videos generated by various video generation models using the prompt suite defined in our HAbench_full.json. Video-Bench supports evaluating videos produced by our preselected seven generation models as well as any additional models specified by the user.

To evaluate videos, simply specify the models to be tested in the --models parameter. For example, if you want to evaluate videos under modelname1 and modelname2, use the following commands with --models modelname1 modelname2

python evaluate.py \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --config_path ./config.json \
 --models modelname1 modelname2

or

HAbench \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --config_path ./config.json \
 --models modelname1 modelname2

Custom Mode

This mode allows users to evaluate videos generated from prompts that are not included in the Video-Bench prompt suite.

You can provide prompts in two ways:

  1. Single prompt: Use --prompt "your customized prompt" to specify a single prompt.
  2. Multiple prompts: Create a JSON file and use --prompt_file $json_path. Create a JSON file containing your prompts and use --prompt_file $json_path to load them. The JSON file can follow this format:
{
    0: "prompt1",
    1: "prompt2",
    ...
}

For static quality dimensions, set mode=custom_static:

python evaluate.py \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --mode custom_static \
 --config_path ./config.json \
 --models modelname1 modelname2

or

HAbench \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --mode custom_static \
 --config_path ./config.json \
 --models modelname1 modelname2

For video-text alignment or dynamic quality dimensions, set mode=custom_nonstatic:

python evaluate.py \
 --dimension $DIMENSION \ 
 --videos_path ./data/ \
 --mode custom_nonstatic \
 --config_path ./config.json \
 --models modelname1 modelname2

or

HAbench \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --mode custom_nonstatic \
 --config_path ./config.json \
 --models modelname1 modelname2

Citation

If you use our dataset, code or find Video-Bench useful, please cite our paper in your work as:

@article{ni2023content,
  title={Video-Bench: Human Preference Aligned Video Generation Benchmark},
  author={Han, Hui and Li, Siyuan and Chen, Jiaqi and Yuan, Yiwen and Wu, Yuling and Leong, Chak Tou and Du, Hanwen and Fu, Junchen and Li, Youhua and Zhang, Jie and Zhang, Chi and Li, Li-jia and Ni, Yongxin},
  journal={arXiv preprint arXiv:xxx},
  year={2024}
}

Literature

Video Generation Evaluation Methods

Model Paper Resource Conference/Journal/Preprint Year Features
Video-Bench Link GitHub Arxiv 2024 Video-Bench leverages Multimodal Large Language Models (MLLMs) to provide highly accurate evaluations that closely align with human preferences across multiple dimensions of video quality. It incorporates few-shot scoring and chain-of-query techniques, allowing for scalable and structured assessments. Video-Bench supports cross-modal consistency and offers more objective insights when diverging from human judgments, making it a more reliable and comprehensive tool for video generation evaluation. It also demonstrates unique strength compared to human ratings in terms of accuracy.
FETV Link GitHub NeurIPS 2023 FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity.
FVD Link GitHub ICLR Workshop 2023 A novel metric for generative video models that extends the Fréchet Inception Distance (FID) to account for not only visual quality but also temporal coherence and diversity, addressing the lack of qualitative metrics in current video generation evaluation.
GAIA Link GitHub Arxiv 2024 By adopting a causal reasoning perspective, it evaluates popular text-to-video (T2V) models on their ability to generate visually rational actions and benchmarks existing automatic evaluation methods, revealing a significant gap between current models and human perception patterns.
SAVGBench Link Links Arxiv 2024 This work introduces a benchmark for Spatially Aligned Audio-Video Generation (SAVG), focusing on spatial alignment between audio and visuals. Key innovations include a new dataset, a baseline diffusion model for stereo audio-visual learning, and a spatial alignment metric, revealing significant gaps in quality and alignment between the model and ground truth.
VBench++ Link GitHub Arxiv 2024 VBench++ is a comprehensive benchmark for video generation, featuring 16 evaluation dimensions, human alignment validation, and support for both text-to-video and image-to-video models, assessing both technical quality and model trustworthiness.
T2V-CompBench Link GitHub Arxiv 2024 T2V-CompBench evaluates diverse aspects such as attribute binding, spatial relationships, motion, and object interactions. It introduces tailored evaluation metrics based on MLLM, detection, and tracking, validated by human evaluation.
VideoScore Link Website EMNLP 2024 It introduces a dataset with human-provided multi-aspect scores for 37.6K videos from 11 generative models. VideoScore is trained on this to provide automatic video quality assessment, achieving a 77.1 Spearman correlation with human ratings.
ChronoMagic-Bench Link Website NeurIPS 2024 ChronoMagic-Bench evaluates T2V models on their ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence, using 1,649 prompts across four categories. Its advantages include the introduction of new metrics (MTScore and CHScore) and a large-scale dataset (ChronoMagic-Pro) for comprehensive, high-quality evaluation.
T2VSafetyBench Link GitHub NeurIPS 2024 T2VSafetyBench introduces a benchmark for assessing the safety of text-to-video models, focusing on 12 critical aspects of video generation safety, including temporal risks. It addresses the unique safety concerns of video generation, providing a malicious prompt dataset, and offering valuable insights into the trade-off between usability and safety.
T2VBench Link Website CVPR 2024 T2VBench focuses on 16 critical temporal dimensions such as camera transitions and event sequences for evaluating text-to-video models, consisting of a hierarchical framework with over 1,600 prompts and 5,000 videos.
EvalCrafter Link Website CVPR 2024 EvalCrafter provides a systematic framework for benchmarking and evaluating large-scale video generation models, ensuring high-quality assessments across various video generation attributes.
VQAScore Link GitHub ECCV 2024 This work introduces VQAScore, a novel alignment metric that uses a visual-question-answering model to assess image-text coherence, addressing the limitations of CLIPScore with complex prompts. It also presents GenAI-Bench, a challenging benchmark of 1,600 compositional prompts and 15,000 human ratings, enabling more accurate evaluation of generative models like Stable Diffusion and DALL-E 3.
VBench Link GitHub CVPR 2024 VBench introduces a comprehensive evaluation benchmark for video generation, addressing the misalignment between current metrics and human perception. Its key innovations include 16 detailed evaluation dimensions, human preference alignment for validation, and the ability to assess various content types and model gaps.
DEVIL Link GitHub NeurIPS 2024 DEVIL introduces a new benchmark with dynamic scores at different temporal granularities, achieving over 90% Pearson correlation with human ratings for comprehensive model assessment.
AIGCBench Link Website Arxiv 2024 AIGCBench is a benchmark for evaluating image-to-video (I2V) generation. It incorporates an open-domain image-text dataset and introduces 11 metrics across four dimensions—alignment, motion effects, temporal consistency, and video quality.
MiraData Link GitHub NeurIPS 2024 MiraData offers longer videos, stronger motion intensity, and more detailed captions. Paired with MiraBench to enhance evaluation with metrics like 3D consistency and motion strength.
PhyGenEval Link Website Arxiv 2024 PhyGenBench is designed to evaluate the understanding of physical commonsense in text-to-video (T2V) generation, consisting of 160 prompts covering 27 physical laws across four domains, paired with the PhyGenEval evaluation framework that enables assessments of models' adherence to physical commonsense.
VideoPhy Link GitHub Arxiv 2024 VideoPhy is a benchmark designed to assess the physical commonsense accuracy of generated videos, particularly for T2V models, by evaluating their adherence to real-world physical laws and behaviors.
T2VHE Link GitHub Arxiv 2024 The T2VHE protocol is an approach for evaluating text-to-video (T2V) models, addressing challenges in reproducibility, reliability, and practicality of manual evaluations. It includes defined metrics, annotator training, and a dynamic evaluation module.

Popular repositories Loading

  1. Video-Bench Video-Bench Public

    Video Generation Benchmark

    Python 6 1