Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Language-Guided Image Tokenization for Generation

About

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization. Project page is at: https://kaiwenzha.github.io/textok/.

Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu• 2024

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256 (train)
IS303.1
367
Image ReconstructionImageNet (val)
rFID0.73
143
Conditional Image GenerationImageNet 512x512 (val)
gFID1.8
92
Image ReconstructionImageNet 256x256 (val)--
53
Conditional Image GenerationImageNet 256x256 (val)
Inception Score303.1
45
Video ReconstructionWebVid 10M
PSNR27.42
45
Image ReconstructionImageNet-1K 1.0 (val)
rFID0.73
35
Frame ReconstructionCOCO (val)
PSNR29.29
12
Showing 8 of 8 rows

Other info

Follow for update