Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

About

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy48.8
1525
Object Hallucination EvaluationPOPE
Accuracy87.6
1455
Visual Question AnsweringVQA v2
Accuracy77.3
1362
Visual Question AnsweringTextVQA
Accuracy57.5
1285
Visual Question AnsweringGQA
Accuracy62.6
1249
Text-based Visual Question AnsweringTextVQA
Accuracy62.3
807
Multimodal EvaluationMME
Score78
658
Multimodal UnderstandingMMBench
Accuracy69.2
637
Science Question AnsweringScienceQA--
502
Multimodal ReasoningMM-Vet
MM-Vet Score34.4
431
Showing 10 of 74 rows
...

Other info

Follow for update