Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
About
Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Reasoning | MMMU (val) | Accuracy47.4 | 114 | |
| Information Visual Question Answering | InfoVQA (test) | ANLS58.3 | 92 | |
| Mathematical Reasoning | MathVista mini | Accuracy59 | 72 | |
| Visual Question Answering | TextVQA v1.0 (val) | Accuracy76.4 | 69 | |
| Video Understanding | Video-MME without subtitles | -- | 67 | |
| Diagram Understanding | AI2D 1.0 (test) | Accuracy75 | 58 | |
| Mathematical Reasoning | MathVerse mini | Accuracy26.2 | 50 | |
| Document Visual Question Answering | DocVQA v1.0 (test) | ANLS85 | 49 | |
| OCR Evaluation | OCRBench 1.0 | Score772 | 33 | |
| Chart Understanding | ChartQA 1.0 (test) | Avg Relaxed Accuracy76.5 | 33 |