VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
About
Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy67.9 | 247 | |
| Long-form Video Understanding | LongVideoBench | Accuracy49.1 | 82 | |
| Video Understanding | Video-MME without subtitles | Overall Score62.1 | 67 | |
| Long Video Understanding | LVBench | Accuracy34.3 | 63 | |
| Video Understanding | MLVU | M-AVG69.5 | 54 | |
| Temporal Action Localization | ActivityNet v1.3 (test) | -- | 47 | |
| Temporal Grounding | ActivityNet Captions | Recall@1 (IoU=0.5)33.4 | 45 | |
| Long Video Understanding | Video-MME Long | Accuracy46.2 | 37 | |
| Temporal Grounding | Charades-STA | mIoU60.8 | 33 | |
| Video Understanding | Video-MME Long | Accuracy (Long, wo Sub)53.4 | 32 |