Regularized Vector Quantization for Tokenized Image Synthesis
About
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. Predominant approaches learn the discrete representation either in a deterministic manner by selecting the best-matching token or in a stochastic manner by sampling from a predicted distribution. However, deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while stochastic quantization suffers from low codebook utilization and perturbed reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate above issues effectively by applying regularization from two perspectives. The first is a prior distribution regularization which measures the discrepancy between a prior token distribution and the predicted token distribution to avoid codebook collapse and low codebook utilization. The second is a stochastic mask regularization that introduces stochasticity during quantization to strike a good balance between inference stage misalignment and unperturbed reconstruction objective. In addition, we design a probabilistic contrastive loss which serves as a calibrated metric to further mitigate the perturbed reconstruction objective. Extensive experiments show that the proposed quantization framework outperforms prevailing vector quantization methods consistently across different generative models including auto-regressive models and diffusion models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Reconstruction | CelebA-HQ (test) | FID (Reconstruction)10.09 | 50 | |
| Semantic Image Synthesis | ADE20K (val) | FID34.47 | 47 | |
| Text-to-Image Synthesis | CUB-200-2011 (test) | -- | 20 | |
| Semantic Synthesis | CelebA-HQ | FID15.34 | 10 | |
| Text-to-Image Synthesis | MS-COCO 2017 (test) | FID19.91 | 7 | |
| Image Reconstruction | ADE20K semantic labels (val) | FID (Reconstruction)23.69 | 4 | |
| Image Reconstruction | CUB-200 (test) | FID (Reconstruction)10.84 | 4 | |
| Image Reconstruction | MS-COCO 2017 (test) | FID13.76 | 4 | |
| Semantic Image Synthesis | CelebA-HQ (test) | FID (G)15.34 | 4 |