Quantize-then-Rectify: Efficient VQ-VAE Training

About

Visual tokenizers are pivotal in multimodal large models, acting as bridges between continuous inputs and discrete tokens. Nevertheless, training high-compression-rate VQ-VAEs remains computationally demanding, often necessitating thousands of GPU hours. This work demonstrates that a pre-trained VAE can be efficiently transformed into a VQ-VAE by controlling quantization noise within the VAE's tolerance threshold. We present \textbf{Quantize-then-Rectify (ReVQ)}, a framework leveraging pre-trained VAEs to enable rapid VQ-VAE training with minimal computational overhead. By integrating \textbf{channel multi-group quantization} to enlarge codebook capacity and a \textbf{post rectifier} to mitigate quantization errors, ReVQ compresses ImageNet images into at most 512 tokens while sustaining competitive reconstruction quality (rFID = 1.06). Significantly, ReVQ reduces training costs by over two orders of magnitude relative to state-of-the-art approaches: ReVQ finishes full training on a single NVIDIA 4090 in approximately 22 hours, whereas comparable methods require 4.5 days on 32 A100 GPUs. Experimental results show that ReVQ achieves superior efficiency-reconstruction trade-offs.

Borui Zhang, Qihang Rao, Wenzhao Zheng, Jie Zhou, Jiwen Lu• 2025

Related benchmarks

Task	Dataset	Result	Rank
Image Reconstruction	ImageNet (val)	rFID2.05		143
Image Reconstruction	ImageNet 50K 256x256 (val)	rFID2.05		16

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord