Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reinforcing Video Reasoning with Focused Thinking

About

Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}.

Jisheng Dang, Jingze Wu, Teng Wang, Xuanhui Lin, Nannan Zhu, Hongbo Chen, Wei-Shi Zheng, Meng Wang, Tat-Seng Chua• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy63.3
425
Video UnderstandingVideoMME--
222
Video UnderstandingVideo-MME
Overall Score60.1
92
Multi-modal Video UnderstandingMVBench--
70
Video UnderstandingLVBench
Average Score33.8
67
Video UnderstandingVideoMME--
60
Video PerceptionPerception (test)
Accuracy54.9
57
Grounded Video Question AnsweringNExT-GQA
mIoU17.7
44
Video ReasoningVideo-Holmes
Score38.4
34
Temporal ReasoningTempCompass
Accuracy73.3
33
Showing 10 of 22 rows

Other info

Follow for update