Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

About

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMME	Accuracy63	251
Long Video Understanding	LongVideoBench (val)	Accuracy56.6	225
Video Understanding	VideoMME	Overall Score63	222
Long Video Understanding	LVBench	Accuracy43.2	218
Video Question Answering	LongVideoBench	Accuracy56.6	210
Video Question Answering	VideoMMMU	Accuracy52.7	140
Long Video Understanding	Video-MME Long	Accuracy54.1	92
Video Question Answering	LongVideoBench (val)	Accuracy56.6	87
Audio-Visual Question Answering	AVQA	Accuracy86.6	85
Video Understanding	MMVU	Accuracy66.4	76

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord