Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

About

Although reinforcement learning (RL) has emerged as a promising approach for improving vision-language models (VLMs) and multimodal large language models (MLLMs), current methods rely heavily on manually curated datasets and costly human verification, which limits scalable self-improvement in multimodal systems. To address this challenge, we propose Vision-Zero, a label-free, domain-agnostic multi-agent self-play framework for self-evolving VLMs through competitive visual games generated from arbitrary image inputs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code have been released at https://github.com/wangqinsi1/Vision-Zero.

Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Multimodal Evaluation	MME	Score2.46e+3	727
Visual Question Answering	ChartQA	Accuracy86.3	519
Visual Mathematical Reasoning	MathVista	Accuracy70.2	366
Multi-discipline Multimodal Understanding	MMMU	Accuracy56.98	363
Visual Question Answering	AI2D	Accuracy84.8	317
Science Question Answering	ScienceQA (test)	Average Accuracy88.5	273
Mathematical Multimodal Reasoning	MathVerse	Accuracy46.8	259
Visual Mathematical Reasoning	MathVision	Accuracy26.12	254
Multimodal Math Reasoning	MathVision	Accuracy27.6	246

Showing 10 of 79 rows

...

Other info

Follow for update

@wizwand_team Discord