Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

About

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMM-Vet--
631
Visual Question AnsweringRealworldQA
Accuracy70.85
259
Mathematical Multimodal ReasoningMathVerse
Accuracy49.11
259
Mathematical Multimodal ReasoningMathVista
Accuracy68.7
258
Multimodal Math ReasoningMathVision
Accuracy36.1
246
Mathematical ReasoningMathVerse--
183
Chart UnderstandingChartQA
Accuracy83.64
159
Multimodal UnderstandingMMMU
Accuracy59.63
76
Multimodal UnderstandingMMMU, MMVet, RealWQA, ChartQA, MathVerse, MathVision, MathVista Aggregate
Weighted Average Score57.37
8
Mathematical ReasoningMathVision
Gain Score8.78
6
Showing 10 of 15 rows

Other info

Follow for update