Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

About

Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.

Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu, Yaozong Zheng, Yafei Liu, Yeling Peng, Youwei Wang, Sibo Wang, Huiming Yang, Linglin Liao, Shunzhi Yang• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMMU
Accuracy74.8
437
Diagram UnderstandingAI2D
Accuracy87.8
247
Mathematical Multimodal ReasoningMathVerse
Accuracy63.5
221
Mathematical Multimodal ReasoningMathVista
Accuracy79.9
218
Visual Mathematical ReasoningMathVision
Accuracy58.2
186
Multimodal ReasoningMMStar
Accuracy71.5
143
Showing 6 of 6 rows

Other info

Follow for update