MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

About

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, Chunhua Shen• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.6	2019
Visual Question Answering	VizWiz	Accuracy48.8	1820
Visual Question Answering	TextVQA	Accuracy57.5	1453
Visual Question Answering	VQA v2	Accuracy77.3	1429
Visual Question Answering	GQA	Accuracy62.6	1425
Text-based Visual Question Answering	TextVQA	Accuracy62.3	962
Multimodal Understanding	MMBench	Accuracy69.2	847
Science Question Answering	ScienceQA	--	791
Multimodal Evaluation	MME	Score78	727
Multimodal Reasoning	MM-Vet	MM-Vet Score34.4	517

Showing 10 of 74 rows

...

Other info

Follow for update

@wizwand_team Discord