Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models

About

Recent studies posit that Reinforcement Learning with Verifiable Rewards (RLVR) primarily amplifies behaviors inherent to the pre-training distribution rather than inducing new capabilities, but these insights are predominantly limited to language-only domains, leaving the dynamics of visual-centric spatial reasoning under-explored. To examine the impact of RLVR on the capability boundaries of Vision-Language Models (VLMs), we introduce \textbf{Ariadne}, a controlled framework based on synthetic maze navigation where the reasoning difficulty is precisely regulated by path length and the number of turns. We demonstrate that applying RLVR extends the spatial reasoning boundary, achieving success on problems where the base policy VLM consistently attains $0\%$ accuracy despite increasing pass@k sampling budgets, indicating that the optimized policy successfully navigates search spaces that were effectively unreachable by the base distribution. Furthermore, despite being trained exclusively on synthetic mazes, we evaluate the model on two real-world navigation benchmarks (MapBench and ReasonMap) in a zero-shot setting. The observed improvements in these out-of-domain tasks suggest genuine spatial reasoning capability expansion rather than mere sampling efficiency.

Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu• 2025

Related benchmarks

Task	Dataset	Result
High-level Planning	ReasonMap L (long questions)	Weighted Accuracy0.0747	3
Navigation	MapBench	Avg Path Length (Google Map)1.3	3
High-level Planning	ReasonMap S (short questions)	Weighted Accuracy14.5	3

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord