Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning

About

Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.

Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu, Yaozong Zheng, Yafei Liu, Yeling Peng, Youwei Wang, Sibo Wang, Huiming Yang, Linglin Liao, Shunzhi Yang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMMU	Accuracy74.8	437
Diagram Understanding	AI2D	Accuracy87.8	317
Mathematical Multimodal Reasoning	MathVerse	Accuracy63.5	259
Mathematical Multimodal Reasoning	MathVista	Accuracy79.9	258
Visual Mathematical Reasoning	MathVision	Accuracy58.2	254
Multimodal Reasoning	MMStar	Accuracy71.5	143

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord