Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

About

Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Songjing Wang, Yulong Ao, Yiming Ju, Huanhuan Ma, Xiaotong Li, Haiwen Diao, Yufeng Cui, Xinlong Wang, Yaoqi Liu, Fangxiang Feng, Guang Liu• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningMMMU (val)
Accuracy47.4
114
Information Visual Question AnsweringInfoVQA (test)
ANLS58.3
92
Mathematical ReasoningMathVista mini
Accuracy59
72
Visual Question AnsweringTextVQA v1.0 (val)
Accuracy76.4
69
Video UnderstandingVideo-MME without subtitles--
67
Diagram UnderstandingAI2D 1.0 (test)
Accuracy75
58
Mathematical ReasoningMathVerse mini
Accuracy26.2
50
Document Visual Question AnsweringDocVQA v1.0 (test)
ANLS85
49
OCR EvaluationOCRBench 1.0
Score772
33
Chart UnderstandingChartQA 1.0 (test)
Avg Relaxed Accuracy76.5
33
Showing 10 of 31 rows

Other info

Code

Follow for update