VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
About
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy84.19 | 935 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score56.19 | 281 | |
| Visual Mathematical Reasoning | MathVista | Accuracy73.7 | 189 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy56.67 | 167 | |
| Multimodal Reasoning | MMMU (val) | Accuracy56.7 | 114 | |
| Hallucination Evaluation | HallusionBench | -- | 93 | |
| Optical Character Recognition | OCRBench | OCRBench Score85.4 | 83 | |
| Multimodal Reasoning | MMStar | Accuracy64.2 | 81 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy51.7 | 73 | |
| Visual Mathematical Reasoning | MathVision | Accuracy29.7 | 63 |