Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

About

Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy58.9	635
Long Video Understanding	LVBench	Accuracy54.3	267
Temporal Video Understanding	TempCompass	Accuracy67.5	160
Video Understanding	MLVU	Accuracy65.2	147
General Video Understanding	Video-MME	Accuracy60	139
Video Understanding	LongVideoBench	Accuracy56	128
Video Reasoning	VSI-Bench	Accuracy26.3	101
Video Understanding	MMVU	Accuracy64.5	91
Video Understanding	Video-MME	Accuracy61	90
Video Reasoning	Video-Holmes	Accuracy43.1	89

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord