Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

About

We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	VizWiz	Accuracy41.7	1820
Visual Question Answering	TextVQA	Accuracy39.9	1453
Visual Question Answering	GQA	Accuracy58.3	1425
Multimodal Evaluation	MME	Score1.25e+3	727
Science Question Answering	ScienceQA IMG	Accuracy53.3	335
Multimodal Model Evaluation	MMBench	Accuracy52	204
Multimodal Evaluation	MM-Vet	--	196
Multimodal Model Evaluation	MMBench Chinese	Accuracy45.7	154
Zero-shot Language Modeling	Prominent Language Benchmarks (ARC, BoolQ, HellaSwag, OpenBookQA, PIQA, SciQ, TriviaQA, Winogrande)	ARC-Challenge Acc28.16	5

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord