Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
About
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | VideoMME | Overall Score63 | 222 | |
| Video Question Answering | VideoMME | Accuracy63 | 210 | |
| Long Video Understanding | LongVideoBench (val) | Accuracy56.6 | 210 | |
| Video Question Answering | LongVideoBench | Accuracy56.6 | 180 | |
| Long Video Understanding | LVBench | Accuracy43.2 | 133 | |
| Video Question Answering | VideoMMMU | Accuracy52.7 | 124 | |
| Video Understanding | VideoMME | -- | 60 | |
| Video Question Answering | LongVideoBench (val) | Accuracy56.6 | 55 | |
| Long Video Understanding | Video-MME Long | Accuracy54.1 | 46 | |
| Video Reasoning | Video-MMMU | Accuracy52.7 | 45 |