VideoScore2: Think before You Score in Generative Video Evaluation

About

Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/

Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	T2V-CompBench	Consistency Attribute Score80.86	92
Video Generation	VBench 2.0	Human Fidelity0.7809	26
Content Quality Assessment	UltraVQA	Acc@0.560.2	14
Motion Quality Assessment	UltraVQA	Acc@0.569.8	14
Aesthetic Quality Assessment	UltraVQA	Accuracy @0.563.7	14
Clarity Quality Assessment	UltraVQA	Acc@0.575.3	14
Motion Amplitude Assessment	UltraVQA	Accuracy@0.570.5	14
Long Video Quality Evaluation	HoloCine	Spearman Correlation0.061	12
Long Video Quality Evaluation	Sora 2	Spearman Corr0.004	12
Long Video Quality Evaluation	StoryMem	Spearman Correlation-0.069	12

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord