Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

About

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy67.9
247
Long-form Video UnderstandingLongVideoBench
Accuracy49.1
82
Video UnderstandingVideo-MME without subtitles
Overall Score62.1
67
Long Video UnderstandingLVBench
Accuracy34.3
63
Video UnderstandingMLVU
M-AVG69.5
54
Temporal Action LocalizationActivityNet v1.3 (test)--
47
Temporal GroundingActivityNet Captions
Recall@1 (IoU=0.5)33.4
45
Long Video UnderstandingVideo-MME Long
Accuracy46.2
37
Temporal GroundingCharades-STA
mIoU60.8
33
Video UnderstandingVideo-MME Long
Accuracy (Long, wo Sub)53.4
32
Showing 10 of 69 rows

Other info

Follow for update