Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

About

Although reinforcement learning (RL) has emerged as a promising approach for improving vision-language models (VLMs) and multimodal large language models (MLLMs), current methods rely heavily on manually curated datasets and costly human verification, which limits scalable self-improvement in multimodal systems. To address this challenge, we propose Vision-Zero, a label-free, domain-agnostic multi-agent self-play framework for self-evolving VLMs through competitive visual games generated from arbitrary image inputs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code have been released at https://github.com/wangqinsi1/Vision-Zero.

Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Multimodal EvaluationMME
Score2.46e+3
658
Visual Question AnsweringChartQA
Accuracy86.3
371
Multi-discipline Multimodal UnderstandingMMMU
Accuracy56.98
317
Visual Mathematical ReasoningMathVista
Accuracy70.2
278
Visual Question AnsweringAI2D
Accuracy84.8
249
Science Question AnsweringScienceQA (test)
Average Accuracy88.5
245
Mathematical Multimodal ReasoningMathVerse
Accuracy46.8
221
Visual Mathematical ReasoningMathVision
Accuracy26.12
186
Multimodal Math ReasoningMathVision
Accuracy27.6
183
Showing 10 of 66 rows

Other info

Follow for update