VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
About
Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy67.9 | 563 | |
| Video Understanding | VideoMME | -- | 357 | |
| Video Question Answering | VideoMME | Accuracy64.1 | 251 | |
| Video Understanding | VideoMME | -- | 222 | |
| Long Video Understanding | LVBench | Accuracy34.3 | 218 | |
| Video Understanding | MVBench (test) | Accuracy67.9 | 190 | |
| Temporal Video Understanding | TempCompass | Accuracy73.9 | 141 | |
| Video Question Answering | VideoMMMU | Accuracy52.34 | 140 | |
| Long-form Video Understanding | LongVideoBench | Accuracy49.1 | 135 | |
| Video Understanding | LongVideoBench | -- | 123 |