Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

About

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME

Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy81
1820
Visual Question AnsweringTextVQA
Accuracy53.2
1453
Visual Question AnsweringGQA
Accuracy81.2
1425
Visual Question AnsweringChartQA
Accuracy57.2
519
Visual Question AnsweringScienceQA
Accuracy80.4
446
Visual Question AnsweringVQA v2
Accuracy75.7
333
Multimodal EvaluationMM-Vet--
196
Multimodal EvaluationMMStar
Accuracy48.1
139
Visual Question AnsweringDocVQA
ANLS50.8
59
Visual Question AnsweringMRAG-Bench
Overall Accuracy66.78
14
Showing 10 of 10 rows

Other info

Follow for update