mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
About
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy88.2 | 1455 | |
| Visual Question Answering | TextVQA | Accuracy69 | 1285 | |
| Visual Question Answering | GQA | Accuracy65 | 1249 | |
| Visual Question Answering | GQA | Accuracy65 | 505 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score40.1 | 431 | |
| Video Understanding | MVBench | Accuracy59.5 | 425 | |
| Multimodal Capability Evaluation | MM-Vet | Score40.1 | 345 | |
| Visual Question Answering | OK-VQA | Accuracy60.1 | 260 | |
| Long Video Understanding | LongVideoBench | Score52.1 | 248 | |
| Diagram Understanding | AI2D | Accuracy73.4 | 247 |