Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
About
Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Reasoning | MMMU | Accuracy17.64 | 208 | |
| Mathematical Reasoning | MathVerse | Accuracy21.76 | 183 | |
| Visual Reasoning | BLINK | Accuracy51.23 | 107 | |
| Chart Understanding and Reasoning | ChartQA | Accuracy73.68 | 87 | |
| Multimodal Reasoning | ScienceQA | Average Accuracy87.58 | 45 | |
| Multimodal Reasoning | Medical and Mathematical Multimodal Reasoning SLAKE, VQA-Rad, Geo3K | Overall Performance68.87 | 36 | |
| Multimodal Medical Reasoning | VQA-RAD | Accuracy (%)78.49 | 36 | |
| Multimodal Reasoning | Slake | Accuracy87.61 | 18 | |
| Multimodal Reasoning | Geo3K | Accuracy45.59 | 18 |