MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

About

Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.

Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, Dinh Phung• 2022

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256	Inception Score (IS)130.1	967
Class-conditional Image Generation	ImageNet 256x256 (val)	Inception Score (IS)138.3	493
Image Reconstruction	ImageNet (val)	rFID1.12	143
Image Reconstruction	ImageNet1K (val)	FID1.12	124
Image Generation	ImageNet-1k (val)	FID7.13	106
Conditional Image Generation	ImageNet-1K 256x256 (val)	gFID7.13	86
Image Reconstruction	FFHQ (val)	PSNR26.72	66
Image Reconstruction	ImageNet 50k 1k (val)	rFID1.12	25
Unconditional Image Generation	FFHQ 256x256 (test)	FID8.52	25
Image Generation	ImageNet-1K 1.0 (val)	FID8.78	17

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord