SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
About
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Mathematical Reasoning | MathVista | Accuracy79.8 | 189 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy69.7 | 167 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy64.2 | 73 | |
| Visual Mathematical Reasoning | WeMath | Accuracy76.9 | 53 | |
| Multimodal Reasoning | MMStar | Accuracy73.3 | 29 | |
| Chart-based Reasoning | CharXivRQ | Accuracy52.7 | 16 | |
| Vision-Language Hallucination Evaluation | HallBench | Accuracy61.2 | 15 |