Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

About

Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy62.1
563
Video UnderstandingVideoMME
Score (Overall)59.8
357
Video Question AnsweringVideoMME
Accuracy64.1
251
Video UnderstandingVideoMME--
222
Long Video UnderstandingLVBench
Accuracy34.7
218
Video Question AnsweringMLVU
Accuracy45
194
Video UnderstandingMVBench (test)
Accuracy62.1
190
Temporal Video UnderstandingTempCompass
Accuracy73.7
141
Video Question AnsweringVideoMMMU
Accuracy52.32
140
Video UnderstandingLongVideoBench--
123
Showing 10 of 95 rows
...

Other info

Follow for update