Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

About

Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero

Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMMU
Accuracy58.6
437
Long Video UnderstandingMLVU
Accuracy64.6
205
Logical reasoningLogicVista
Accuracy48.6
113
Temporal GroundingActivityNet
Recall@0.337.21
102
Long Video UnderstandingLongVideoBench
Accuracy60.6
97
Long Video UnderstandingVideo-MME Long
Accuracy56
92
Video ReasoningVideoMMMU
Accuracy64.33
89
Mathematical ReasoningMathVista mini (test)
Accuracy69.2
75
Video ReasoningVideoMathQA
Accuracy24.52
61
Video ReasoningMMVU
Accuracy68.96
57
Showing 10 of 20 rows

Other info

Follow for update