VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

About

The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE).To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.22 in merely 80 epochs (a 10$\times$ speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.

Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng• 2025

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 256x256	--	606
Image Classification	ImageNet	Top-1 Accuracy43.2	384
Text-to-Image Generation	MJHQ-30K	Overall FID17	239
Image Reconstruction	ImageNet (val)	rFID0.52	158
Image Reconstruction	ImageNet-1k 256 x 256 (val)	rFID0.52	144
Image Generation	ImageNet 256x256 (test)	FID3.41	125
Class-conditional Image Generation	ImageNet (val)	IS300.2	116
Image Generation	ImageNet	FID3.8	106
Text-to-Image Generation	DPG-Bench	Average Score59.1	77
Image Reconstruction	ImageNet 50k (val)	rFID0.52	47

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord