jina-vlm: Small Multilingual Vision Language Model

About

We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.

Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVista	Accuracy59.5	382
Text-based Visual Question Answering	TextVQA (val)	Accuracy83.2	276
Mathematical Reasoning	WeMath	Accuracy17.1	225
Multimodal Reasoning	MMMU	Accuracy45.6	208
Mathematical Reasoning	MathVision	Accuracy19.2	168
Document Visual Question Answering	DocVQA (val)	Accuracy90.6	166
Logical reasoning	LogicVista	Accuracy33.3	113
Visual Question Answering	InfoVQA (val)	Accuracy71.6	91
Visual Question Answering	AI2D (test)	Accuracy82	82
Visual Question Answering	OCRBench	Score778	53

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord