mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

About

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.2	2019
Visual Question Answering	TextVQA	Accuracy69	1453
Visual Question Answering	GQA	Accuracy65	1425
Multimodal Understanding	MMBench	--	847
Video Understanding	MVBench	Accuracy59.5	563
Visual Question Answering	GQA	Accuracy65	524
Multimodal Reasoning	MM-Vet	MM-Vet Score40.1	517
Multimodal Capability Evaluation	MM-Vet	Score40.1	393
Diagram Question Answering	AI2D	AI2D Accuracy73.4	387
Diagram Understanding	AI2D	Accuracy73.4	317

Showing 10 of 140 rows

...

Other info

Code

Follow for update

@wizwand_team Discord