ZAYA1-VL-8B Technical Report

About

We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)80	183
OCR & Document Understanding	OCRBench	Score79.8	77
Multimodal Reasoning	SEED-Bench Image	Score72.7	60
Counting	countbenchqa	Accuracy88.1	45
Perception and Reasoning	RealworldQA	Score65	37
Cognition and Reasoning	MMMU (val)	Score46	28
Math and Reasoning	MathVista mini	Overall Score64	26
Document and chart understanding	ChartQA (test)	Accuracy82.2	22
Document Understanding, OCR & Charts	TextVQA (val)	Score74.4	18
Document Understanding, OCR & Charts	DocVQA (test)	Score92.5	16

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord