Regularized Vector Quantization for Tokenized Image Synthesis

About

Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. Predominant approaches learn the discrete representation either in a deterministic manner by selecting the best-matching token or in a stochastic manner by sampling from a predicted distribution. However, deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while stochastic quantization suffers from low codebook utilization and perturbed reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate above issues effectively by applying regularization from two perspectives. The first is a prior distribution regularization which measures the discrepancy between a prior token distribution and the predicted token distribution to avoid codebook collapse and low codebook utilization. The second is a stochastic mask regularization that introduces stochasticity during quantization to strike a good balance between inference stage misalignment and unperturbed reconstruction objective. In addition, we design a probabilistic contrastive loss which serves as a calibrated metric to further mitigate the perturbed reconstruction objective. Extensive experiments show that the proposed quantization framework outperforms prevailing vector quantization methods consistently across different generative models including auto-regressive models and diffusion models.

Jiahui Zhang, Fangneng Zhan, Christian Theobalt, Shijian Lu• 2023

Related benchmarks

Task	Dataset	Result
Image Reconstruction	CelebA-HQ (test)	FID (Reconstruction)10.09	50
Semantic Image Synthesis	ADE20K (val)	FID34.47	47
Text-to-Image Synthesis	CUB-200-2011 (test)	--	20
Semantic Synthesis	CelebA-HQ	FID15.34	10
Text-to-Image Synthesis	MS-COCO 2017 (test)	FID19.91	7
Image Reconstruction	ADE20K semantic labels (val)	FID (Reconstruction)23.69	4
Image Reconstruction	CUB-200 (test)	FID (Reconstruction)10.84	4
Image Reconstruction	MS-COCO 2017 (test)	FID13.76	4
Semantic Image Synthesis	CelebA-HQ (test)	FID (G)15.34	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord