Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniTok: A Unified Tokenizer for Visual Generation and Understanding

About

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Visual Question AnsweringGQA
Accuracy61.1
1425
Text-based Visual Question AnsweringTextVQA
Accuracy51.6
962
Multimodal UnderstandingMMBench--
847
Text-to-Image GenerationGenEval
Overall Score59
704
Multimodal UnderstandingMM-Vet
MM-Vet Score33.9
631
Text-to-Image GenerationGenEval
Overall Score59
517
Class-conditional Image GenerationImageNet 256x256 (val)
Inception Score (IS)216.7
493
Text-to-Image GenerationDPG-Bench
Overall Score81.18
451
Text-to-Image GenerationGenEval
GenEval Score59
442
Showing 10 of 80 rows
...

Other info

Follow for update