Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

About

Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose \textbf{Sim}ple\textbf{VQ}, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the \textit{entire linear space} rather than nearest \textit{individual code vectors}. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at https://github.com/youngsheen/SimVQ.

Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, Linli Xu• 2024

Related benchmarks

Task	Dataset	Result
Image Reconstruction	ImageNet	PSNR25.3304	56
Image Reconstruction	COCO (test)	CVU0.9429	24
Audio Reconstruction	Common Voice	CVU0.0041	21
Audio Reconstruction	LibriSpeech (test-clean test-other)	CVU0.004	21
Visual Reconstruction	ImageNet-1k (val)	rFID2.63	16
Multimodal Recommendation	Toys	Recall@31.91	9
Multimodal Recommendation	Beauty	Recall@31.82	9
Codebook utilization	Cora	Perplexity40.09	8
Codebook utilization	Citeseer	Perplexity38.11	8
Codebook utilization	Ratings	Perplexity16.08	8

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord