jina-vlm: Small Multilingual Vision Language Model
About
We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathVista | Accuracy59.5 | 382 | |
| Text-based Visual Question Answering | TextVQA (val) | Accuracy83.2 | 276 | |
| Mathematical Reasoning | WeMath | Accuracy17.1 | 225 | |
| Multimodal Reasoning | MMMU | Accuracy45.6 | 208 | |
| Mathematical Reasoning | MathVision | Accuracy19.2 | 168 | |
| Document Visual Question Answering | DocVQA (val) | Accuracy90.6 | 166 | |
| Logical reasoning | LogicVista | Accuracy33.3 | 113 | |
| Visual Question Answering | InfoVQA (val) | Accuracy71.6 | 91 | |
| Visual Question Answering | AI2D (test) | Accuracy82 | 82 | |
| Visual Question Answering | OCRBench | Score778 | 53 |
Showing 10 of 20 rows