MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
About
We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy77.3 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy57.5 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy48.8 | 1043 | |
| Visual Question Answering | GQA | Accuracy62.6 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy86.1 | 935 | |
| Multimodal Evaluation | MME | Score78 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy62.3 | 496 | |
| Multimodal Understanding | MMBench | Accuracy69.2 | 367 | |
| Science Question Answering | ScienceQA IMG | Accuracy70 | 256 | |
| Science Question Answering | ScienceQA | -- | 229 |
Showing 10 of 44 rows