MANTIS: Interleaved Multi-Image Instruction Tuning
About
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy59.2 | 1285 | |
| Video Understanding | MVBench | Accuracy51.4 | 425 | |
| Science Question Answering | ScienceQA (SQA) | Accuracy56.8 | 273 | |
| Visual Question Answering | OK-VQA | Accuracy55.4 | 260 | |
| Visual Question Answering | AI2D | Accuracy46.8 | 249 | |
| Video Understanding | VideoMME | -- | 222 | |
| 3D Question Answering | ScanQA (val) | -- | 217 | |
| Visual Question Answering | VQAv2 | Accuracy74.9 | 177 | |
| Visual Perception | BLINK | -- | 122 | |
| Multimodal Understanding | SEED-Bench Image | Accuracy59.3 | 121 |