Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Zamba2-VL Technical Report

About

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA (val)
VQA Score81
365
Document Visual Question AnsweringDocVQA (test)
ANLS92.9
292
Text-based Visual Question AnsweringTextVQA (val)--
276
Visual Question AnsweringGQA (test-dev)
Accuracy60.2
236
Multimodal UnderstandingMMMU (val)
MMMU Score43.8
199
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)82.8
183
Diagram UnderstandingAI2D (test)
Accuracy90.6
154
Chart UnderstandingChartQA (test)
Accuracy85.3
113
Object Hallucination EvaluationPOPE (test)
Accuracy88.6
107
Visual Question AnsweringAI2D (test)
Accuracy90.6
82
Showing 10 of 18 rows

Other info

Follow for update