Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

About

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy62.1
425
Video UnderstandingVideoMME
Score (Long)50.7
248
Video UnderstandingVideoMME--
222
Video Question AnsweringVideoMME
Accuracy64.1
210
Video Question AnsweringMLVU
Accuracy45
143
Video Question AnsweringVideoMMMU
Accuracy52.32
124
Video Question AnsweringLVBench
Accuracy41.1
108
Video UnderstandingVideo-MME
Overall Score59.8
92
Video Question AnsweringMVBench
Accuracy62.1
90
Video UnderstandingVideo-MME without subtitles
Overall Score59.8
89
Showing 10 of 60 rows

Other info

Follow for update