Learning Self-Correction in Vision-Language Models via Rollout Augmentation
About
Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Mathematical Reasoning | MathVista | Accuracy82.1 | 189 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy72.4 | 167 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy68.5 | 73 | |
| Visual Mathematical Reasoning | WeMath | Accuracy84 | 53 | |
| Multimodal Reasoning | MMStar | Accuracy75.2 | 29 | |
| Chart-based Reasoning | CharXivRQ | Accuracy55.7 | 16 | |
| Vision-Language Hallucination Evaluation | HallBench | Accuracy64.2 | 15 |