Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

About

In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.

Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, Donglin Wang• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.2
2019
Visual Question AnsweringTextVQA
Accuracy57.9
1453
Visual Question AnsweringVQA v2
Accuracy76.9
1429
Visual Question AnsweringGQA
Accuracy59.9
1425
Optical Character RecognitionOCRBench--
433
Multimodal UnderstandingMMStar
Accuracy34.7
407
Visual Question AnsweringTextVQA (val)
VQA Score59.5
365
Visual Question AnsweringGQA (test-dev)
Accuracy63.9
236
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)79.2
183
Multimodal Sentiment AnalysisMOSEI--
183
Showing 10 of 24 rows

Other info

Follow for update