Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Jina-VLM: Small Multilingual Vision Language Model

About

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA (val)
Accuracy83.2
262
Mathematical ReasoningMathVista
Accuracy59.5
257
Mathematical ReasoningWeMath
Accuracy17.1
161
Document Visual Question AnsweringDocVQA (val)
Accuracy90.6
157
Mathematical ReasoningMathVision
Accuracy19.2
144
Multimodal ReasoningMMMU
Accuracy45.6
130
Visual Question AnsweringInfoVQA (val)
Accuracy71.6
91
Logical reasoningLogicVista
Accuracy33.3
84
Visual Question AnsweringAI2D (test)
Accuracy82
73
Multilingual text-centric visual question answeringMTVQA
Average Score25.6
37
Showing 10 of 20 rows

Other info

Follow for update