Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

About

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, Jian Tang• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.3
2019
Visual Question AnsweringVizWiz
Accuracy45.7
1820
Visual Question AnsweringTextVQA
Accuracy56.6
1453
Visual Question AnsweringGQA
Accuracy63.9
1425
Multimodal EvaluationMME--
727
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy81.3
712
Multimodal UnderstandingMM-Vet
MM-Vet Score32.1
631
Multimodal UnderstandingSEED-Bench--
516
Multi-discipline Multimodal UnderstandingMMMU
Accuracy32.5
363
Science Question AnsweringScienceQA IMG
Accuracy70.9
335
Showing 10 of 37 rows

Other info

Code

Follow for update