Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

About

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.2
1455
Visual Question AnsweringTextVQA
Accuracy69
1285
Visual Question AnsweringGQA
Accuracy65
1249
Visual Question AnsweringGQA
Accuracy65
505
Multimodal ReasoningMM-Vet
MM-Vet Score40.1
431
Video UnderstandingMVBench
Accuracy59.5
425
Multimodal Capability EvaluationMM-Vet
Score40.1
345
Visual Question AnsweringOK-VQA
Accuracy60.1
260
Long Video UnderstandingLongVideoBench
Score52.1
248
Diagram UnderstandingAI2D
Accuracy73.4
247
Showing 10 of 118 rows
...

Other info

Code

Follow for update