mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

About

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.2	2019
Visual Question Answering	VizWiz	Accuracy54.5	1820
Visual Question Answering	TextVQA	Accuracy58.2	1453
Visual Question Answering	VQA v2	Accuracy79.4	1429
Visual Question Answering	GQA	Accuracy56.11	1425
Text-based Visual Question Answering	TextVQA	Accuracy53.9	962
Multimodal Understanding	MMBench	Accuracy64.5	847
Language Understanding	MMLU	Accuracy53.4	844
Science Question Answering	ScienceQA	--	791
Multimodal Evaluation	MME	Score1.45e+3	727

Showing 10 of 239 rows

...

Other info

Code

Follow for update

@wizwand_team Discord