Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
About
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10$\times$ larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model are released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GeoQA (test) | Accuracy52.2 | 31 | |
| Mathematical Reasoning | MathVista Math | ALL Accuracy75.93 | 19 | |
| Mathematical Reasoning | MMStar Math | Accuracy70 | 19 | |
| Human Preference Alignment | MM-AlignBench 1.0 (test) | Win Rate61.5 | 18 | |
| Remote Sensing Visual Question Answering | XLRS-Bench | Average Score0.462 | 17 | |
| Visual Reasoning | V* cross-domain (test) | Accuracy72.25 | 15 | |
| Visual Reasoning | HR-Bench (test) | Accuracy59.69 | 15 | |
| Visual Reasoning | VisualProbe (VP) cross-domain (test) | Accuracy0.2102 | 15 |