VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
About
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy84.19 | 1455 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score56.19 | 431 | |
| Visual Question Answering | ChartQA | Accuracy84.7 | 371 | |
| Multimodal Understanding | MMStar | Accuracy61.8 | 324 | |
| Object Hallucination | POPE Adversarial | Accuracy82.8 | 288 | |
| Object Hallucination | POPE (Random) | F1 Score83.2 | 285 | |
| Visual Mathematical Reasoning | MathVista | Accuracy74.9 | 278 | |
| Object Hallucination | POPE Popular | F1 Score82.6 | 273 | |
| Mathematical Reasoning | MathVista | Accuracy74.9 | 257 | |
| Visual Question Answering | AI2D | Accuracy80.8 | 249 |