Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

About

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy84.19
1455
Multimodal ReasoningMM-Vet
MM-Vet Score56.19
431
Visual Question AnsweringChartQA
Accuracy84.7
371
Multimodal UnderstandingMMStar
Accuracy61.8
324
Object HallucinationPOPE Adversarial
Accuracy82.8
288
Object HallucinationPOPE (Random)
F1 Score83.2
285
Visual Mathematical ReasoningMathVista
Accuracy74.9
278
Object HallucinationPOPE Popular
F1 Score82.6
273
Mathematical ReasoningMathVista
Accuracy74.9
257
Visual Question AnsweringAI2D
Accuracy80.8
249
Showing 10 of 154 rows
...

Other info

Follow for update