LG-VQ: Language-Guided Codebook Learning
About
Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Reconstruction | WebVid 10M | PSNR30.23 | 34 | |
| Image Reconstruction | CUB-200 | FID3.08 | 13 | |
| Text-to-Image Synthesis | CelebA-HQ | FID12.33 | 13 | |
| Frame Reconstruction | COCO (val) | PSNR31.32 | 12 | |
| Semantic Synthesis | CelebA-HQ | FID11.03 | 10 | |
| Image Reconstruction | CelebA-HQ | FID4.9 | 9 | |
| Image Captioning | CUB-200 | BLEU-41.69 | 8 | |
| Unconditional Image Generation | CelebA-HQ | FID9.1 | 8 | |
| Image Reconstruction | MS-COCO | FID9.69 | 7 | |
| Visual Question Answering | COCO-QA | Accuracy40.97 | 7 |