TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

About

Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.

Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	GQA	Accuracy59.5	1445
Multimodal Evaluation	MME	--	902
Multimodal Understanding	MMBench	--	887
Multimodal Understanding	MM-Vet	MM-Vet Score29.8	664
Multimodal Understanding	SEED-Bench	--	571
Multi-discipline Multimodal Understanding	MMMU	Accuracy43.1	422
Multimodal Understanding	MME	--	207
Visual Question Answering	GQA	Score59.5	193
Visual Understanding	MM-Vet	MM-Vet Score29.8	190

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord