VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

About

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy62.1	635
Video Understanding	VideoMME	Score (Overall)59.8	369
Long Video Understanding	LVBench	Accuracy34.7	267
Long Video Understanding	MLVU	--	265
Video Question Answering	VideoMME	Accuracy64.1	254
Video Understanding	MLVU	Score66.6	233
Video Understanding	VideoMME	--	222
Video Question Answering	MLVU	Accuracy45	213
Video Understanding	MVBench (test)	Accuracy62.1	201
Video Question Answering	VideoMMMU	Accuracy52.32	166

Showing 10 of 114 rows

...

Other info

Follow for update

@wizwand_team Discord