V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

About

Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero

Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMMU	Accuracy58.6	437
Long Video Understanding	MLVU	Accuracy64.6	205
Logical reasoning	LogicVista	Accuracy48.6	113
Temporal Grounding	ActivityNet	Recall@0.337.21	102
Long Video Understanding	LongVideoBench	Accuracy60.6	97
Long Video Understanding	Video-MME Long	Accuracy56	92
Video Reasoning	VideoMMMU	Accuracy64.33	89
Mathematical Reasoning	MathVista mini (test)	Accuracy69.2	75
Video Reasoning	VideoMathQA	Accuracy24.52	61
Video Reasoning	MMVU	Accuracy68.96	57

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord