MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

About

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen• 2025

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA	Accuracy84.3	962
Visual Question Answering	ChartQA	--	519
Text-to-Image Generation	GenEval	Overall Score85	517
Multimodal Understanding	SEED-Bench	--	516
Mathematical Reasoning	MathVista	Score73.3	474
Multi-discipline Multimodal Understanding	MMMU	--	363
OCR Evaluation	OCRBench	Score86.3	350
Diagram Understanding	AI2D (test)	Accuracy86	154
Reasoning-based text-to-image generation	WISE	Overall Score54	70
Multi-modal Understanding	MMBench EN	Overall Score83.4	55

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord