Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

About

We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.

Wanting Xu, Yang Liu, Langping He, Xucheng Huang, Ling Jiang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy41.7
1525
Object Hallucination EvaluationPOPE--
1455
Visual Question AnsweringTextVQA
Accuracy39.9
1285
Visual Question AnsweringGQA
Accuracy58.3
1249
Multimodal EvaluationMME
Score1.25e+3
658
Science Question AnsweringScienceQA IMG
Accuracy53.3
294
Multimodal Model EvaluationMMBench
Accuracy52
180
Multimodal EvaluationMM-Vet--
180
Multimodal Model EvaluationMMBench Chinese
Accuracy45.7
154
Zero-shot Language ModelingProminent Language Benchmarks (ARC, BoolQ, HellaSwag, OpenBookQA, PIQA, SciQ, TriviaQA, Winogrande)
ARC-Challenge Acc28.16
5
Showing 10 of 11 rows

Other info

Code

Follow for update