Zamba2-VL Technical Report

About

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA (val)	VQA Score81	371
Document Visual Question Answering	DocVQA (test)	ANLS92.9	292
Text-based Visual Question Answering	TextVQA (val)	--	276
Visual Question Answering	GQA (test-dev)	Accuracy60.2	236
Multimodal Understanding	MMMU (val)	MMMU Score43.8	211
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)82.8	183
Diagram Understanding	AI2D (test)	Accuracy90.6	154
Object Hallucination Evaluation	POPE (test)	Accuracy88.6	123
Chart Understanding	ChartQA (test)	Accuracy85.3	119
Visual Question Answering	AI2D (test)	Accuracy90.6	82

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord