VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

About

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy67.9	563
Video Understanding	VideoMME	--	357
Video Question Answering	VideoMME	Accuracy64.1	251
Video Understanding	VideoMME	--	222
Long Video Understanding	LVBench	Accuracy34.3	218
Video Understanding	MVBench (test)	Accuracy67.9	190
Temporal Video Understanding	TempCompass	Accuracy73.9	141
Video Question Answering	VideoMMMU	Accuracy52.34	140
Long-form Video Understanding	LongVideoBench	Accuracy49.1	135
Video Understanding	LongVideoBench	--	123

Showing 10 of 132 rows

...

Other info

Follow for update

@wizwand_team Discord