Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

jina-vlm: Small Multilingual Vision Language Model

About

We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.

Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Accuracy59.5
382
Text-based Visual Question AnsweringTextVQA (val)
Accuracy83.2
276
Mathematical ReasoningWeMath
Accuracy17.1
225
Multimodal ReasoningMMMU
Accuracy45.6
208
Mathematical ReasoningMathVision
Accuracy19.2
168
Document Visual Question AnsweringDocVQA (val)
Accuracy90.6
166
Logical reasoningLogicVista
Accuracy33.3
113
Visual Question AnsweringInfoVQA (val)
Accuracy71.6
91
Visual Question AnsweringAI2D (test)
Accuracy82
82
Visual Question AnsweringOCRBench
Score778
53
Showing 10 of 20 rows

Other info

Follow for update