VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

About

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMME	Accuracy71.7	254
Video Question Answering	LongVideoBench	Accuracy67.4	224
Video Question Answering	MLVU	Accuracy65.1	213
Multimodal Understanding	MMMU (val)	--	211
Video Question Answering	VideoMMMU	Accuracy65	166
Multimodal Mathematical Reasoning	MathVista mini	Accuracy0.737	124
Temporal Grounding	Charades-STA	mIoU63.7	120
Temporal Grounding	ActivityNet	Recall@0.374.1	111
Video Question Answering	LVBench	Accuracy41.5	108
Video Reasoning	Video-MMMU	Accuracy55.6	83

Showing 10 of 28 rows

Other info

GitHub

Follow for update

@wizwand_team Discord