Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

About

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy69
1117
Visual Question AnsweringGQA
Accuracy65
963
Object Hallucination EvaluationPOPE
Accuracy88.2
935
Visual Question AnsweringGQA
Accuracy65
374
Multimodal Capability EvaluationMM-Vet
Score40.1
282
Multimodal ReasoningMM-Vet
MM-Vet Score40.1
281
Video UnderstandingMVBench
Accuracy59.5
247
Visual Question AnsweringOK-VQA
Accuracy60.1
224
Video UnderstandingVideoMME--
192
Multimodal Model EvaluationMMBench
Accuracy77.6
180
Showing 10 of 88 rows
...

Other info

Code

Follow for update