Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
About
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy39.9 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy41.7 | 1043 | |
| Visual Question Answering | GQA | Accuracy58.3 | 963 | |
| Object Hallucination Evaluation | POPE | -- | 935 | |
| Multimodal Evaluation | MME | Score1.25e+3 | 557 | |
| Science Question Answering | ScienceQA IMG | Accuracy53.3 | 256 | |
| Multimodal Model Evaluation | MMBench | Accuracy52 | 180 | |
| Multimodal Evaluation | MM-Vet | Accuracy21.8 | 122 | |
| Multimodal Model Evaluation | MMBench Chinese | Accuracy45.7 | 121 | |
| Zero-shot Language Modeling | Prominent Language Benchmarks (ARC, BoolQ, HellaSwag, OpenBookQA, PIQA, SciQ, TriviaQA, Winogrande) | ARC-Challenge Acc28.16 | 5 |