Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
About
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset; and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach enhances the multimodal reasoning abilities of both InternVL2-8B and InternVL2-76B. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10$\times$ larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model are released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Multimodal Reasoning | MathVerse | Accuracy43.7 | 259 | |
| Mathematical Multimodal Reasoning | MathVista | Accuracy76.6 | 258 | |
| Multimodal Math Reasoning | MathVision | Accuracy36.2 | 246 | |
| Multimodal Math Reasoning | WeMath | Accuracy37.6 | 211 | |
| Mathematical Reasoning | DynaMath | Accuracy21.2 | 127 | |
| Video Quality Assessment | YouTube-UGC | -- | 110 | |
| NACE industry classification | MONETA | Accuracy62.1 | 84 | |
| Multimodal Logical Reasoning | LogicVista | Accuracy50.8 | 63 | |
| Mathematical Reasoning | GeoQA (test) | Accuracy52.2 | 31 | |
| Mathematical Reasoning | MathVista Math | ALL Accuracy75.93 | 19 |