MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

About

Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce MM-LIMA, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, MM-LIMA outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.

Lai Wei, Xiaozhe Li, Zihao Jiang, Weiran Huang, Lichao Sun• 2023

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy20.6	962
Multimodal Understanding	MMBench	Accuracy31.4	847
Multimodal Evaluation	MME	--	727
Multimodal Understanding	MM-Vet	MM-Vet Score31.4	631
Multimodal Reasoning	MMBench	MMBench Accuracy (en)31.4	61
Multimodal Understanding	MME Perception	--	59
Multimodal Understanding	Multimodal Benchmarks Aggregate	Relative Performance39.75	13
Visual Question Answering	STVQA, VizWiz, DocVQA, TextVQA	STVQA Score14.55	3
Multimodal Model Evaluation	MME	Existence Score73.33	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord