HABench is a benchmark tool designed to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, HA-Video-Bench provides a structured, scalable approach to generated video evaluation.
⭐Overview | 📒Leaderboard | 🤗HumanAlignment | 🛠️Installation | 🗃️Preparation | ⚡Instructions | 🚀Usage | 📭Citation | 📝Literature
- Overview
- Leaderboard
- Installation
- HumanAlignment
- Installation
- Praparation
- Instruction
- Usage
- Citation
- Literature
Model | Imaging Quality | Aesthetic Quality | Temporal Consist. | Motion Effects | Avg Rank | Video-text Consist. | Object-class Consist. | Color Consist. | Action Consist. | Scene Consist. | Avg Rank | Overall Avg Rank |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Cogvideox [57] | 3.87 | 3.84 | 4.14 | 3.55 | 3.00 | 4.62 | 2.81 | 2.92 | 2.81 | 2.93 | 1.60 | 2.22 |
Gen3 [42] | 4.66 | 4.44 | 4.74 | 3.99 | 1.00 | 4.38 | 2.81 | 2.87 | 2.59 | 2.93 | 2.40 | 1.78 |
Kling [24] | 4.26 | 3.82 | 4.38 | 3.11 | 2.75 | 4.07 | 2.70 | 2.81 | 2.50 | 2.82 | 4.60 | 3.78 |
VideoCrafter2 [5] | 4.08 | 3.85 | 3.69 | 2.81 | 3.75 | 4.18 | 2.85 | 2.90 | 2.53 | 2.78 | 2.80 | 3.22 |
LaVie [52] | 3.00 | 2.94 | 3.00 | 2.43 | 7.00 | 3.71 | 2.82 | 2.81 | 2.45 | 2.63 | 5.00 | 5.88 |
PiKa-Beta [38] | 3.78 | 3.76 | 3.40 | 2.59 | 5.50 | 3.78 | 2.51 | 2.52 | 2.25 | 2.60 | 6.80 | 6.22 |
Show-1 [60] | 3.30 | 3.28 | 3.90 | 2.90 | 5.00 | 4.21 | 2.82 | 2.79 | 2.53 | 2.72 | 3.80 | 4.33 |
Notes:
- Higher scores indicate better performance.
- The best score in each dimension is highlighted in bold.
Metrics | Benchmark | Imaging Quality | Aesthetic Quality | Temporal Consist. | Motion Effects | Video-text Consist. | Action Consist. | Object-class Consist. | Color Consist. | Scene Consist. |
---|---|---|---|---|---|---|---|---|---|---|
MUSIQ [21] | VBench [19] | 0.363 | - | - | - | - | - | - | - | - |
LAION | VBench [19] | - | 0.446 | - | - | - | - | - | - | - |
CLIP [40] | VBench [19] | - | - | 0.260 | - | - | - | - | - | - |
RAFT [48] | VBench [19] | - | - | - | 0.329 | - | - | - | - | - |
Amt [28] | VBench [19] | - | - | - | 0.329 | - | - | - | - | - |
ViCLIP [53] | VBench [19] | - | - | - | - | 0.445 | - | - | - | - |
UMT [27] | VBench [19] | - | - | - | - | - | 0.411 | - | - | - |
GRiT [54] | VBench [19] | - | - | - | - | - | - | 0.469 | 0.545 | - |
Tag2Text [16] | VBench [19] | - | - | - | - | - | - | - | - | 0.422 |
ComBench [46] | ComBench [46] | - | - | - | - | 0.633 | 0.633 | 0.611 | 0.696 | 0.631 |
Video-Bench | Video-Bench | 0.733 | 0.702 | 0.402 | 0.514 | 0.732 | 0.718 | 0.735 | 0.750 | 0.733 |
Notes:
- Higher scores indicate better performance.
- The best score in each dimension is highlighted in bold.
- Python >= 3.8
- OpenAI API access
Update your OpenAI API keys in
config.json
:{ "GPT4o_API_KEY": "your-api-key", "GPT4o_BASE_URL": "your-base-url", "GPT4o_mini_API_KEY": "your-mini-api-key", "GPT4o_mini_BASE_URL": "your-mini-base-url" }
-
Install with pip
pip install HAbench
-
Install with git clone
git clone https://github.com/yourusername/Video-Bench.git cd Video-Bench pip install -r requirements.txt
wget https://huggingface.co/Video-Bench/Video-Bench -O ./pytorch_model.bin
or
curl -L https://huggingface.co/Video-Bench/Video-Bench -o ./pytorch_model.bin
Please organize your data according to the following data structure:
# Data Structure
/Video-Bench/data/
├── color/ # 'color' dimension videos
│ ├── cogvideox5b/
│ │ ├── A red bird_0.mp4
│ │ ├── A red bird_1.mp4
│ │ └── ...
│ ├── lavie/
│ │ ├── A red bird_0.mp4
│ │ ├── A red bird_1.mp4
│ │ └── ...
│ ├── pika/
│ │ └── ...
│ └── ...
│
├── object_class/ # 'object_class' dimension videos
│ ├── cogvideox5b/
│ │ ├── A train_0.mp4
│ │ ├── A train_1.mp4
│ │ └── ...
│ ├── lavie/
│ │ └── ...
│ └── ...
│
├── scene/ # 'scene' dimension videos
│ ├── cogvideox5b/
│ │ ├── Botanical garden_0.mp4
│ │ ├── Botanical garden_1.mp4
│ │ └── ...
│ └── ...
│
├── action/ # 'action' 'temporal_consistency' 'motion_effects' dimension videos
│ ├── cogvideox5b/
│ │ ├── A person is marching_0.mp4
│ │ ├── A person is marching_1.mp4
│ │ └── ...
│ └── ...
│
└── video-text consistency/ # 'video-text consistency' 'imaging_quality' 'aesthetic_quality' dimension videos
├── cogvideox5b/
│ ├── Close up of grapes on a rotating table._0.mp4
│ └── ...
├── lavie/
│ └── ...
├── pika/
│ └── ...
└── ...
Video-Bench provides comprehensive evaluation across multiple dimensions of video generation quality. Each dimension is assessed using a specific scoring scale to ensure accurate and meaningful evaluation.
Dimension | Description | Scale | Module |
---|---|---|---|
Static Quality | |||
Image Quality | Evaluates technical aspects including clarity and sharpness | 1-5 | staticquality.py |
Aesthetic Quality | Assesses visual appeal and artistic composition | 1-5 | staticquality.py |
Dynamic Quality | |||
Temporal Consistency | Measures frame-to-frame coherence and smoothness | 1-5 | dynamicquality.py |
Motion Effects | Evaluates quality of movement and dynamics | 1-5 | dynamicquality.py |
Video-Text Alignment | |||
Video-Text Consistency | Overall alignment with text prompt | 1-5 | VideoTextAlignment.py |
Object-Class Consistency | Accuracy of object representation | 1-3 | VideoTextAlignment.py |
Color Consistency | Matching of colors with text prompt | 1-3 | VideoTextAlignment.py |
Action Consistency | Accuracy of depicted actions | 1-3 | VideoTextAlignment.py |
Scene Consistency | Correctness of scene environment | 1-3 | VideoTextAlignment.py |
Video-Bench supports two modes: standard mode and custom input mode.
Video-Bench only supports assessments of the following dimensions: 'aesthetic_quality', 'imaging_quality','temporal_consistency', 'motion_effects','color', 'object_class', 'scene', 'action', 'video-text consistency'
This evaluation mode assesses videos generated by various video generation models using the prompt suite defined in our HAbench_full.json. Video-Bench supports evaluating videos produced by our preselected seven generation models as well as any additional models specified by the user.
To evaluate videos, simply specify the models to be tested in the --models parameter. For example, if you want to evaluate videos under modelname1
and modelname2
, use the following commands with --models modelname1 modelname2
python evaluate.py \
--dimension $DIMENSION \
--videos_path ./data/ \
--config_path ./config.json \
--models modelname1 modelname2
or
HAbench \
--dimension $DIMENSION \
--videos_path ./data/ \
--config_path ./config.json \
--models modelname1 modelname2
This mode allows users to evaluate videos generated from prompts that are not included in the Video-Bench prompt suite.
You can provide prompts in two ways:
- Single prompt: Use
--prompt "your customized prompt"
to specify a single prompt. - Multiple prompts: Create a JSON file and use
--prompt_file $json_path
. Create a JSON file containing your prompts and use --prompt_file $json_path to load them. The JSON file can follow this format:
{
0: "prompt1",
1: "prompt2",
...
}
For static quality dimensions, set mode=custom_static
:
python evaluate.py \
--dimension $DIMENSION \
--videos_path ./data/ \
--mode custom_static \
--config_path ./config.json \
--models modelname1 modelname2
or
HAbench \
--dimension $DIMENSION \
--videos_path ./data/ \
--mode custom_static \
--config_path ./config.json \
--models modelname1 modelname2
For video-text alignment or dynamic quality dimensions, set mode=custom_nonstatic
:
python evaluate.py \
--dimension $DIMENSION \
--videos_path ./data/ \
--mode custom_nonstatic \
--config_path ./config.json \
--models modelname1 modelname2
or
HAbench \
--dimension $DIMENSION \
--videos_path ./data/ \
--mode custom_nonstatic \
--config_path ./config.json \
--models modelname1 modelname2
If you use our dataset, code or find Video-Bench useful, please cite our paper in your work as:
@article{ni2023content,
title={Video-Bench: Human Preference Aligned Video Generation Benchmark},
author={Han, Hui and Li, Siyuan and Chen, Jiaqi and Yuan, Yiwen and Wu, Yuling and Leong, Chak Tou and Du, Hanwen and Fu, Junchen and Li, Youhua and Zhang, Jie and Zhang, Chi and Li, Li-jia and Ni, Yongxin},
journal={arXiv preprint arXiv:xxx},
year={2024}
}
Model | Paper | Resource | Conference/Journal/Preprint | Year | Features |
---|---|---|---|---|---|
Video-Bench | Link | GitHub | Arxiv | 2024 | Video-Bench leverages Multimodal Large Language Models (MLLMs) to provide highly accurate evaluations that closely align with human preferences across multiple dimensions of video quality. It incorporates few-shot scoring and chain-of-query techniques, allowing for scalable and structured assessments. Video-Bench supports cross-modal consistency and offers more objective insights when diverging from human judgments, making it a more reliable and comprehensive tool for video generation evaluation. It also demonstrates unique strength compared to human ratings in terms of accuracy. |
FETV | Link | GitHub | NeurIPS | 2023 | FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. |
FVD | Link | GitHub | ICLR Workshop | 2023 | A novel metric for generative video models that extends the Fréchet Inception Distance (FID) to account for not only visual quality but also temporal coherence and diversity, addressing the lack of qualitative metrics in current video generation evaluation. |
GAIA | Link | GitHub | Arxiv | 2024 | By adopting a causal reasoning perspective, it evaluates popular text-to-video (T2V) models on their ability to generate visually rational actions and benchmarks existing automatic evaluation methods, revealing a significant gap between current models and human perception patterns. |
SAVGBench | Link | Links | Arxiv | 2024 | This work introduces a benchmark for Spatially Aligned Audio-Video Generation (SAVG), focusing on spatial alignment between audio and visuals. Key innovations include a new dataset, a baseline diffusion model for stereo audio-visual learning, and a spatial alignment metric, revealing significant gaps in quality and alignment between the model and ground truth. |
VBench++ | Link | GitHub | Arxiv | 2024 | VBench++ is a comprehensive benchmark for video generation, featuring 16 evaluation dimensions, human alignment validation, and support for both text-to-video and image-to-video models, assessing both technical quality and model trustworthiness. |
T2V-CompBench | Link | GitHub | Arxiv | 2024 | T2V-CompBench evaluates diverse aspects such as attribute binding, spatial relationships, motion, and object interactions. It introduces tailored evaluation metrics based on MLLM, detection, and tracking, validated by human evaluation. |
VideoScore | Link | Website | EMNLP | 2024 | It introduces a dataset with human-provided multi-aspect scores for 37.6K videos from 11 generative models. VideoScore is trained on this to provide automatic video quality assessment, achieving a 77.1 Spearman correlation with human ratings. |
ChronoMagic-Bench | Link | Website | NeurIPS | 2024 | ChronoMagic-Bench evaluates T2V models on their ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence, using 1,649 prompts across four categories. Its advantages include the introduction of new metrics (MTScore and CHScore) and a large-scale dataset (ChronoMagic-Pro) for comprehensive, high-quality evaluation. |
T2VSafetyBench | Link | GitHub | NeurIPS | 2024 | T2VSafetyBench introduces a benchmark for assessing the safety of text-to-video models, focusing on 12 critical aspects of video generation safety, including temporal risks. It addresses the unique safety concerns of video generation, providing a malicious prompt dataset, and offering valuable insights into the trade-off between usability and safety. |
T2VBench | Link | Website | CVPR | 2024 | T2VBench focuses on 16 critical temporal dimensions such as camera transitions and event sequences for evaluating text-to-video models, consisting of a hierarchical framework with over 1,600 prompts and 5,000 videos. |
EvalCrafter | Link | Website | CVPR | 2024 | EvalCrafter provides a systematic framework for benchmarking and evaluating large-scale video generation models, ensuring high-quality assessments across various video generation attributes. |
VQAScore | Link | GitHub | ECCV | 2024 | This work introduces VQAScore, a novel alignment metric that uses a visual-question-answering model to assess image-text coherence, addressing the limitations of CLIPScore with complex prompts. It also presents GenAI-Bench, a challenging benchmark of 1,600 compositional prompts and 15,000 human ratings, enabling more accurate evaluation of generative models like Stable Diffusion and DALL-E 3. |
VBench | Link | GitHub | CVPR | 2024 | VBench introduces a comprehensive evaluation benchmark for video generation, addressing the misalignment between current metrics and human perception. Its key innovations include 16 detailed evaluation dimensions, human preference alignment for validation, and the ability to assess various content types and model gaps. |
DEVIL | Link | GitHub | NeurIPS | 2024 | DEVIL introduces a new benchmark with dynamic scores at different temporal granularities, achieving over 90% Pearson correlation with human ratings for comprehensive model assessment. |
AIGCBench | Link | Website | Arxiv | 2024 | AIGCBench is a benchmark for evaluating image-to-video (I2V) generation. It incorporates an open-domain image-text dataset and introduces 11 metrics across four dimensions—alignment, motion effects, temporal consistency, and video quality. |
MiraData | Link | GitHub | NeurIPS | 2024 | MiraData offers longer videos, stronger motion intensity, and more detailed captions. Paired with MiraBench to enhance evaluation with metrics like 3D consistency and motion strength. |
PhyGenEval | Link | Website | Arxiv | 2024 | PhyGenBench is designed to evaluate the understanding of physical commonsense in text-to-video (T2V) generation, consisting of 160 prompts covering 27 physical laws across four domains, paired with the PhyGenEval evaluation framework that enables assessments of models' adherence to physical commonsense. |
VideoPhy | Link | GitHub | Arxiv | 2024 | VideoPhy is a benchmark designed to assess the physical commonsense accuracy of generated videos, particularly for T2V models, by evaluating their adherence to real-world physical laws and behaviors. |
T2VHE | Link | GitHub | Arxiv | 2024 | The T2VHE protocol is an approach for evaluating text-to-video (T2V) models, addressing challenges in reproducibility, reliability, and practicality of manual evaluations. It includes defined metrics, annotator training, and a dynamic evaluation module. |