mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
About
Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy79.4 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy58.2 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy54.5 | 1043 | |
| Visual Question Answering | GQA | Accuracy56.11 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy86.2 | 935 | |
| Language Understanding | MMLU | Accuracy53.4 | 756 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy79.4 | 664 | |
| Multimodal Evaluation | MME | Score1.45e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy53.9 | 496 | |
| Video Question Answering | MSRVTT-QA | Accuracy46.7 | 481 |