Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

About

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench--
847
Multimodal UnderstandingMM-Vet
MM-Vet Score34.6
631
Text-to-Image GenerationDPG-Bench
Overall Score83.36
451
Visual Question AnsweringTextVQA
TextVQA Accuracy55.2
210
Visual Question AnsweringGQA
GQA Score62.4
139
Image ReconstructionImageNet1K (val)
FID0.41
124
Image ReconstructionImageNet-1k 256 x 256 (val)
rFID0.41
112
Multimodal UnderstandingPOPE
POPE Score0.865
112
Visual generationGenEval
Two Obj. Acc88
43
Image ReconstructionMS-COCO 2017 (val)
rFID4.24
33
Showing 10 of 12 rows

Other info

Follow for update