VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
About
Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy67.9 | 425 | |
| Video Understanding | VideoMME | Score (Long)46.2 | 248 | |
| Video Understanding | VideoMME | -- | 222 | |
| Video Question Answering | VideoMME | Accuracy64.1 | 210 | |
| Long Video Understanding | LVBench | Accuracy34.3 | 133 | |
| Video Question Answering | VideoMMMU | Accuracy52.34 | 124 | |
| Long-form Video Understanding | LongVideoBench | Accuracy49.1 | 115 | |
| Video Understanding | Video-MME without subtitles | Overall Score62.1 | 89 | |
| Temporal Grounding | Charades-STA | R@0.571.7 | 88 | |
| Video Understanding | MLVU | Accuracy54.3 | 80 |