Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
About
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | VideoMME | Overall Score63 | 192 | |
| Long Video Understanding | LongVideoBench (val) | Accuracy56.6 | 139 | |
| Video Question Answering | VideoMME | -- | 99 | |
| Long Video Understanding | LVBench | Accuracy43.2 | 63 | |
| Video Question Answering | VideoMMMU | Accuracy52.7 | 52 | |
| Long Video Understanding | Video-MME Overall | Accuracy63 | 39 | |
| Long Video Understanding | Video-MME Long | Accuracy54.1 | 37 | |
| Video Question Answering | LongVideoBench | Accuracy56.6 | 34 | |
| Video Reasoning | Video-MMMU | Accuracy52.7 | 32 | |
| Video Reasoning | Video-Holmes | Score40.7 | 20 |