Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Video-R1: Reinforcing Video Reasoning in MLLMs

About

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.5
1455
Video UnderstandingMVBench
Accuracy64.8
425
Mathematical ReasoningMathVista
Accuracy71
257
Video UnderstandingVideoMME
Score (Long)50.6
248
Long Video UnderstandingLongVideoBench
Score54.6
248
Video UnderstandingVideoMME
Overall Score59.3
222
Video Question AnsweringVideoMME
Accuracy64.3
210
Long Video UnderstandingLongVideoBench (val)--
210
Spatial ReasoningVSI-Bench
Avg Score37.1
192
Video UnderstandingEgoSchema
EgoSchema Score47.6
158
Showing 10 of 175 rows
...

Other info

Follow for update