Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

About

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy84.19
2019
Visual Question AnsweringChartQA
Accuracy84.7
519
Multimodal ReasoningMM-Vet
MM-Vet Score56.23
517
Mathematical ReasoningMathVista
Score74.9
474
Optical Character RecognitionOCRBench--
433
Multimodal UnderstandingMMStar
Accuracy61.8
407
Mathematical ReasoningMathVista
Accuracy74.9
382
Object HallucinationPOPE Popular
F1 Score82.6
372
Visual Mathematical ReasoningMathVista
Accuracy74.9
366
Object HallucinationPOPE Adversarial
Accuracy82.8
353
Showing 10 of 207 rows
...

Other info

Follow for update