WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
About
Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | -- | 847 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score34.6 | 631 | |
| Text-to-Image Generation | DPG-Bench | Overall Score83.36 | 451 | |
| Visual Question Answering | TextVQA | TextVQA Accuracy55.2 | 210 | |
| Visual Question Answering | GQA | GQA Score62.4 | 139 | |
| Image Reconstruction | ImageNet1K (val) | FID0.41 | 124 | |
| Image Reconstruction | ImageNet-1k 256 x 256 (val) | rFID0.41 | 112 | |
| Multimodal Understanding | POPE | POPE Score0.865 | 112 | |
| Visual generation | GenEval | Two Obj. Acc88 | 43 | |
| Image Reconstruction | MS-COCO 2017 (val) | rFID4.24 | 33 |