LG-VQ: Language-Guided Codebook Learning

About

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.

Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, linfeng Luo• 2024

Related benchmarks

Task	Dataset	Result
Video Reconstruction	WebVid 10M	PSNR30.23	45
Visual Grounding	RefCOCO	--	15
Image Reconstruction	CUB-200	FID3.08	13
Text-to-Image Synthesis	CelebA-HQ	FID12.33	13
Frame Reconstruction	COCO (val)	PSNR31.32	12
Semantic Synthesis	CelebA-HQ	FID11.03	10
Image Reconstruction	CelebA-HQ	FID4.9	9
Image Captioning	CUB-200	BLEU-41.69	8
Unconditional Image Generation	CelebA-HQ	FID9.1	8
Image Reconstruction	MS-COCO	FID9.69	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord