MANTIS: Interleaved Multi-Image Instruction Tuning

About

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy59.2	1453
Video Understanding	MVBench	Accuracy51.4	563
Visual Question Answering	AI2D	Accuracy46.8	317
3D Question Answering	ScanQA (val)	--	290
Science Question Answering	ScienceQA (SQA)	Accuracy56.8	273
Visual Question Answering	OK-VQA	Accuracy55.4	272
Visual Perception	BLINK	--	241
Video Understanding	VideoMME	--	222
Visual Question Answering	VQAv2	Accuracy74.9	196
Multimodal Understanding	SEED-Bench Image	Accuracy59.3	143

Showing 10 of 66 rows

Other info

Follow for update

@wizwand_team Discord