Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

About

The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction fidelity and semantic accuracy. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details. As a result, DualToken achieves 0.25 rFID and 82.0% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method surpasses VILA-U by 5.8 points on average across ten visual understanding benchmarks and delivers a 13% improvement on GenAI-Bench. Notably, incorporating dual visual tokens outperforms using a single token type on both understanding and generation tasks. We hope our research offers a new perspective on leveraging dual visual vocabularies for building unified vision-language models. Project page is available at https://songweii.github.io/dualtoken-project-page.

Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, Kaicheng Yu• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMM-Vet
MM-Vet Score44.3
631
Mathematical ReasoningMathVista
Score57.6
474
Multi-discipline Multimodal UnderstandingMMMU
Accuracy45.8
363
Text-to-Image GenerationMJHQ-30K
Overall FID7.88
239
Multimodal UnderstandingMMMU
MMMU Score47.4
232
Visual UnderstandingMM-Vet
MM-Vet Score40.5
167
Vision UnderstandingMMBench
Accuracy74.9
141
Multimodal UnderstandingMME
Score1.63e+3
125
Image ReconstructionImageNet1K (val)
FID0.25
124
Image ReconstructionImageNet-1k 256 x 256 (val)
rFID0.54
112
Showing 10 of 22 rows

Other info

Follow for update