Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MANTIS: Interleaved Multi-Image Instruction Tuning

About

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy59.2
1285
Video UnderstandingMVBench
Accuracy51.4
425
Science Question AnsweringScienceQA (SQA)
Accuracy56.8
273
Visual Question AnsweringOK-VQA
Accuracy55.4
260
Visual Question AnsweringAI2D
Accuracy46.8
249
Video UnderstandingVideoMME--
222
3D Question AnsweringScanQA (val)--
217
Visual Question AnsweringVQAv2
Accuracy74.9
177
Visual PerceptionBLINK--
122
Multimodal UnderstandingSEED-Bench Image
Accuracy59.3
121
Showing 10 of 61 rows

Other info

Follow for update