Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

About

Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.

Qihuang Zhong, Liang Ding, Wenjie Xuan, Juhua Liu, Bo Du, Dacheng Tao• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningMMMU
Accuracy17.64
208
Mathematical ReasoningMathVerse
Accuracy21.76
183
Visual ReasoningBLINK
Accuracy51.23
107
Chart Understanding and ReasoningChartQA
Accuracy73.68
87
Multimodal ReasoningScienceQA
Average Accuracy87.58
45
Multimodal ReasoningMedical and Mathematical Multimodal Reasoning SLAKE, VQA-Rad, Geo3K
Overall Performance68.87
36
Multimodal Medical ReasoningVQA-RAD
Accuracy (%)78.49
36
Multimodal ReasoningSlake
Accuracy87.61
18
Multimodal ReasoningGeo3K
Accuracy45.59
18
Showing 9 of 9 rows

Other info

Follow for update