Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

About

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy84.19
935
Multimodal ReasoningMM-Vet
MM-Vet Score56.19
281
Visual Mathematical ReasoningMathVista
Accuracy73.7
189
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy56.67
167
Multimodal ReasoningMMMU (val)
Accuracy56.7
114
Hallucination EvaluationHallusionBench--
93
Optical Character RecognitionOCRBench
OCRBench Score85.4
83
Multimodal ReasoningMMStar
Accuracy64.2
81
Visual Mathematical ReasoningMathVerse
Accuracy51.7
73
Visual Mathematical ReasoningMathVision
Accuracy29.7
63
Showing 10 of 44 rows

Other info

Follow for update