MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
About
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-based Visual Question Answering | TextVQA | Accuracy84.3 | 496 | |
| Text-to-Image Generation | GenEval | Overall Score85 | 467 | |
| Mathematical Reasoning | MathVista | Score73.3 | 322 | |
| OCR Evaluation | OCRBench | Score86.3 | 296 | |
| Multi-discipline Multimodal Understanding | MMMU | -- | 266 | |
| Visual Question Answering | ChartQA | -- | 239 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Diagram Understanding | AI2D (test) | Accuracy86 | 107 | |
| Multi-modal Understanding | MMBench EN | Overall Score83.4 | 39 | |
| Reasoning-based text-to-image generation | WISE | Overall Score54 | 33 |