Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

About

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME
Overall Score63
192
Long Video UnderstandingLongVideoBench (val)
Accuracy56.6
139
Video Question AnsweringVideoMME--
99
Long Video UnderstandingLVBench
Accuracy43.2
63
Video Question AnsweringVideoMMMU
Accuracy52.7
52
Long Video UnderstandingVideo-MME Overall
Accuracy63
39
Long Video UnderstandingVideo-MME Long
Accuracy54.1
37
Video Question AnsweringLongVideoBench
Accuracy56.6
34
Video ReasoningVideo-MMMU
Accuracy52.7
32
Video ReasoningVideo-Holmes
Score40.7
20
Showing 10 of 14 rows

Other info

Follow for update