Zamba2-VL Technical Report
About
We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA (val) | VQA Score81 | 365 | |
| Document Visual Question Answering | DocVQA (test) | ANLS92.9 | 292 | |
| Text-based Visual Question Answering | TextVQA (val) | -- | 276 | |
| Visual Question Answering | GQA (test-dev) | Accuracy60.2 | 236 | |
| Multimodal Understanding | MMMU (val) | MMMU Score43.8 | 199 | |
| Visual Question Answering | VQA 2.0 (val) | Accuracy (Overall)82.8 | 183 | |
| Diagram Understanding | AI2D (test) | Accuracy90.6 | 154 | |
| Chart Understanding | ChartQA (test) | Accuracy85.3 | 113 | |
| Object Hallucination Evaluation | POPE (test) | Accuracy88.6 | 107 | |
| Visual Question Answering | AI2D (test) | Accuracy90.6 | 82 |