Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

About

We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy39.9
1117
Visual Question AnsweringVizWiz
Accuracy41.7
1043
Visual Question AnsweringGQA
Accuracy58.3
963
Object Hallucination EvaluationPOPE--
935
Multimodal EvaluationMME
Score1.25e+3
557
Science Question AnsweringScienceQA IMG
Accuracy53.3
256
Multimodal Model EvaluationMMBench
Accuracy52
180
Multimodal EvaluationMM-Vet
Accuracy21.8
122
Multimodal Model EvaluationMMBench Chinese
Accuracy45.7
121
Zero-shot Language ModelingProminent Language Benchmarks (ARC, BoolQ, HellaSwag, OpenBookQA, PIQA, SciQ, TriviaQA, Winogrande)
ARC-Challenge Acc28.16
5
Showing 10 of 11 rows

Other info

Code

Follow for update