Autoregressive Image Generation using Residual Quantization
About
For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)323.7 | 441 | |
| Image Generation | ImageNet 256x256 (val) | FID7.55 | 307 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | IS323.7 | 305 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID3.8 | 293 | |
| Image Generation | ImageNet 256x256 | FID3.8 | 243 | |
| Class-conditional Image Generation | ImageNet 256x256 (train val) | FID7.55 | 178 | |
| Class-conditional Image Generation | ImageNet 256x256 (test) | FID3.8 | 167 | |
| Class-conditional Image Generation | ImageNet | FID7.55 | 132 | |
| Unconditional Image Generation | LSUN Bedrooms unconditional | FID3.04 | 96 | |
| Image Reconstruction | ImageNet 256x256 | rFID1.83 | 93 |