MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
About
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy59 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy47.5 | 1117 | |
| Visual Question Answering | GQA | Accuracy59 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy84.9 | 935 | |
| Multimodal Evaluation | MME | Score1.30e+3 | 557 | |
| Visual Question Answering | GQA | Accuracy59.03 | 374 | |
| Multimodal Understanding | MMBench | -- | 367 | |
| Visual Question Answering | TextVQA (val) | VQA Score47.5 | 309 | |
| Science Question Answering | ScienceQA IMG | Accuracy61.2 | 256 | |
| Science Question Answering | ScienceQA | -- | 229 |