SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

About

Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VAE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum• 2024

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256	Inception Score (IS)293.6	967
Class-conditional Image Generation	ImageNet 256x256 (train)	IS279	367
Image Reconstruction	ImageNet 256x256	rFID0.61	202
Class-conditional Image Generation	ImageNet 512x512	FID2.21	126
Class-conditional Image Generation	ImageNet 512x512 (val)	--	102
Conditional Image Generation	ImageNet 512x512 (val)	gFID2.21	92
Image Reconstruction	ImageNet 256x256 (val)	--	53
Conditional Image Generation	ImageNet 256x256 (val)	Inception Score279	45
Conditional Image Generation	ImageNet 256x256 (train val)	Tok. rFID0.61	24
Conditional Image Generation	ImageNet 256x256 1.0 (train val)	FID1.55	23

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord