MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

About

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

Yangzhou Liu, Yue Cao, Zhangwei Gao, Weiyun Wang, Zhe Chen, Wenhai Wang, Hao Tian, Lewei Lu, Xizhou Zhu, Tong Lu, Yu Qiao, Jifeng Dai• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.9	2019
Visual Question Answering	VizWiz	Accuracy55.8	1820
Visual Question Answering	TextVQA	Accuracy60.9	1453
Visual Question Answering	VQA v2	Accuracy80	1429
Visual Question Answering	GQA	Accuracy62.8	1425
Multimodal Understanding	MMBench	Accuracy72.1	847
Multimodal Understanding	SEED-Bench	Accuracy64.7	516
Multimodal Perception and Cognition	MME	Overall Score1.63e+3	270
Multimodal Understanding (Chinese)	MMBench Chinese	Accuracy68	63
Visual Question Answering	ScienceQA (SQAI)	Accuracy74.2	35

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord