Language-Guided Image Tokenization for Generation

About

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization. Project page is at: https://kaiwenzha.github.io/textok/.

Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu• 2024

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256 (train)	IS303.1	367
Image Reconstruction	ImageNet (val)	rFID0.73	143
Conditional Image Generation	ImageNet 512x512 (val)	gFID1.8	92
Image Reconstruction	ImageNet 256x256 (val)	--	53
Conditional Image Generation	ImageNet 256x256 (val)	Inception Score303.1	45
Video Reconstruction	WebVid 10M	PSNR27.42	45
Image Reconstruction	ImageNet-1K 1.0 (val)	rFID0.73	35
Frame Reconstruction	COCO (val)	PSNR29.29	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord