RISE: Reliable Improvement in Self-Evolving Vision-Language Models

About

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MM-Vet	--	664
Visual Question Answering	RealworldQA	Accuracy70.85	327
Mathematical Multimodal Reasoning	MathVista	Accuracy68.7	276
Mathematical Reasoning	MathVerse	--	266
Multimodal Math Reasoning	MathVision	Accuracy36.1	263
Mathematical Multimodal Reasoning	MathVerse	Accuracy49.11	259
Chart Understanding	ChartQA	Accuracy83.64	159
Multimodal Understanding	MMMU	Accuracy59.63	107
Multimodal Understanding	MMMU, MMVet, RealWQA, ChartQA, MathVerse, MathVision, MathVista Aggregate	Weighted Average Score57.37	8
Mathematical Reasoning	MathVision	Gain Score8.78	6

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord