Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

About

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy84.3
496
Text-to-Image GenerationGenEval
Overall Score85
467
Mathematical ReasoningMathVista
Score73.3
322
OCR EvaluationOCRBench
Score86.3
296
Multi-discipline Multimodal UnderstandingMMMU--
266
Visual Question AnsweringChartQA--
239
Multimodal UnderstandingSEED-Bench--
203
Diagram UnderstandingAI2D (test)
Accuracy86
107
Multi-modal UnderstandingMMBench EN
Overall Score83.4
39
Reasoning-based text-to-image generationWISE
Overall Score54
33
Showing 10 of 10 rows

Other info

Follow for update