VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

About

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. Our empirical results show the effectiveness of our approaches.

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy84.19	2019
Visual Question Answering	ChartQA	Accuracy84.7	519
Multimodal Reasoning	MM-Vet	MM-Vet Score56.23	517
Mathematical Reasoning	MathVista	Score74.9	474
Optical Character Recognition	OCRBench	--	433
Multimodal Understanding	MMStar	Accuracy61.8	407
Mathematical Reasoning	MathVista	Accuracy74.9	382
Object Hallucination	POPE Popular	F1 Score82.6	372
Visual Mathematical Reasoning	MathVista	Accuracy74.9	366
Object Hallucination	POPE Adversarial	Accuracy82.8	353

Showing 10 of 207 rows

...

Other info

Follow for update

@wizwand_team Discord