All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

About

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu, Jing Zhang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Capability Evaluation	MM-Vet	--	429
Mathematical Reasoning	WeMath	Accuracy44.1	317
Mathematical Reasoning	MathVerse	--	266
Hallucination and Visual Reasoning Evaluation	HallusionBench	--	61
Mathematical Reasoning	Math Benchmarks Average	Accuracy (ACC)58.8	47
Mathematical Reasoning	MathVista	--	37
Mathematical Reasoning	MathVision	Top-1 Accuracy31.3	27
Mathematical Reasoning	LogicVista	--	27
Mathematical Reasoning	Geometry3K	Accuracy52.1	26
General-purpose Multimodal Understanding	MMStar	--	13

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord