WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

About

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

Yiwei Guo, Shaobin Zhuang, Zhipeng Huang, Canmiao Fu, Chen Li, Jing Lyu, Yali Wang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	--	887
Multimodal Understanding	MM-Vet	MM-Vet Score34.6	664
Text-to-Image Generation	DPG-Bench	Overall Score83.36	510
Visual Question Answering	TextVQA	TextVQA Accuracy55.2	210
Visual Question Answering	GQA	GQA Score62.4	152
Image Reconstruction	ImageNet-1k 256 x 256 (val)	rFID0.41	144
Image Reconstruction	ImageNet1K (val)	FID0.41	124
Multimodal Understanding	POPE	POPE Score0.865	116
Visual generation	GenEval	Two Obj. Acc88	43
Multimodal Perception	MME-P	MME-P Score1.55e+3	35

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord