Vector-quantized Image Modeling with Improved VQGAN
About
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)227.4 | 815 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID3.04 | 427 | |
| Image Generation | ImageNet 256x256 | IS175.1 | 359 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | IS227.4 | 345 | |
| Image Generation | ImageNet 256x256 (val) | FID4.17 | 340 | |
| Class-conditional Image Generation | ImageNet 256x256 (test) | FID3.04 | 208 | |
| Image Generation | ImageNet 256x256 (train) | FID4.17 | 164 | |
| Class-conditional Image Generation | ImageNet | FID4.17 | 158 | |
| Image Reconstruction | ImageNet 256x256 | rFID1.28 | 150 | |
| Image Generation | ImageNet-1K 256x256 (val) | Inception Score227.4 | 113 |