Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

About

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVerse	Accuracy21.76	266
Multimodal Reasoning	MMMU	Accuracy17.64	220
Chart Understanding and Reasoning	ChartQA	Accuracy73.68	143
Visual Reasoning	BLINK	Accuracy51.23	116
Multimodal Medical Reasoning	VQA-RAD	Accuracy (%)78.49	48
Multimodal Reasoning	ScienceQA	Average Accuracy87.58	45
Multimodal Reasoning	Medical and Mathematical Multimodal Reasoning SLAKE, VQA-Rad, Geo3K	Overall Performance68.87	36
Multimodal Reasoning	Slake	Accuracy87.61	30
Multimodal Reasoning	Geo3K	Accuracy45.59	21

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord