Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

About

Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero

Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMMU
Accuracy58.6
275
Mathematical ReasoningMathVista mini (test)
Accuracy69.2
67
Visual PerceptionMMStar
Accuracy65.7
20
Logical reasoningLogicVista
Accuracy48.6
19
Mathematical ReasoningMathVerse Vision Only
Accuracy43.9
14
Mathematical ReasoningMathVision mini (test)
Accuracy0.27
8
Showing 6 of 6 rows

Other info

Follow for update