Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

About

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringVideoMME
Accuracy63
251
Long Video UnderstandingLongVideoBench (val)
Accuracy56.6
225
Video UnderstandingVideoMME
Overall Score63
222
Long Video UnderstandingLVBench
Accuracy43.2
218
Video Question AnsweringLongVideoBench
Accuracy56.6
210
Video Question AnsweringVideoMMMU
Accuracy52.7
140
Long Video UnderstandingVideo-MME Long
Accuracy54.1
92
Video Question AnsweringLongVideoBench (val)
Accuracy56.6
87
Audio-Visual Question AnsweringAVQA
Accuracy86.6
85
Video UnderstandingMMVU
Accuracy66.4
76
Showing 10 of 28 rows

Other info

Follow for update