Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

About

In the realm of image quantization exemplified by VQGAN, the process encodes images into discrete tokens drawn from a codebook with a predefined size. Recent advancements, particularly with LLAMA 3, reveal that enlarging the codebook significantly enhances model performance. However, VQGAN and its derivatives, such as VQGAN-FC (Factorized Codes) and VQGAN-EMA, continue to grapple with challenges related to expanding the codebook size and enhancing codebook utilization. For instance, VQGAN-FC is restricted to learning a codebook with a maximum size of 16,384, maintaining a typically low utilization rate of less than 12% on ImageNet. In this work, we propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100,000, achieving an utilization rate exceeding 99%. Unlike previous methods that optimize each codebook entry, our approach begins with a codebook initialized with 100,000 features extracted by a pre-trained vision encoder. Optimization then focuses on training a projector that aligns the entire codebook with the feature distributions of the encoder in VQGAN-LC. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion- and flow-based generative models. Code and models are available at https://github.com/zh460045050/VQGAN-LC.

Lei Zhu, Fangyun Wei, Yanye Lu, Dong Chen• 2024

Related benchmarks

TaskDatasetResultRank
Image ReconstructionImageNet (val)
rFID2.62
143
Image ReconstructionImageNet1K (val)
FID1.29
124
Image GenerationImageNet-1k (val)
FID8.36
106
Image ReconstructionFFHQ (val)
PSNR26.1
66
Image ReconstructionImageNet 50k (val)
rFID2.62
47
Image GenerationImageNet 1k (train)
FID4.81
38
Visual ReconstructionImageNet-1k (val)
rFID1.29
16
Discriminative RankingAmazon Beauty
AUC64.46
15
Discriminative RankingIndustrial Dataset
AUC0.7071
15
Generative RetrievalIndustrial Dataset
Reconstruction Loss0.0033
14
Showing 10 of 17 rows

Other info

Code

Follow for update