Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Jina-VLM: Small Multilingual Vision Language Model

About

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .

Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA (val)
Accuracy83.2
146
Mathematical ReasoningMathVista
Accuracy59.5
97
Mathematical ReasoningWeMath
Accuracy17.1
75
Document Visual Question AnsweringDocVQA (val)
Accuracy90.6
66
Visual Question AnsweringAI2D (test)
Accuracy82
54
Multimodal ReasoningMMMU
Accuracy45.6
44
Visual Question AnsweringInfoVQA (val)
Accuracy71.6
41
Mathematical ReasoningMathVision
Accuracy19.2
38
Multilingual text-centric visual question answeringMTVQA
Average Score25.6
37
Visual Question AnsweringChartQA (val)
Accuracy81.9
25
Showing 10 of 20 rows

Other info

Follow for update