Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
About
Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach offers an effective, low-cost solution for accurate vision--reasoning process alignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMMU | Accuracy74.8 | 437 | |
| Diagram Understanding | AI2D | Accuracy87.8 | 247 | |
| Mathematical Multimodal Reasoning | MathVerse | Accuracy63.5 | 221 | |
| Mathematical Multimodal Reasoning | MathVista | Accuracy79.9 | 218 | |
| Visual Mathematical Reasoning | MathVision | Accuracy58.2 | 186 | |
| Multimodal Reasoning | MMStar | Accuracy71.5 | 143 |