Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

About

The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE).To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.22 in merely 80 epochs (a 10$\times$ speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.

Tianci Bi, Xiaoyi Zhang, Yan Lu, Nanning Zheng• 2025

Related benchmarks

TaskDatasetResultRank
Image GenerationImageNet 256x256--
517
Image ClassificationImageNet
Top-1 Accuracy43.2
343
Text-to-Image GenerationMJHQ-30K
Overall FID17
239
Image ReconstructionImageNet (val)
rFID0.52
143
Class-conditional Image GenerationImageNet (val)
IS300.2
116
Image ReconstructionImageNet-1k 256 x 256 (val)
rFID0.52
112
Image GenerationImageNet
FID3.8
101
Image GenerationImageNet 256x256 (test)
FID3.41
83
Text-to-Image GenerationDPG-Bench
Average Score59.1
77
Image ReconstructionImageNet 50k (val)
rFID0.52
47
Showing 10 of 12 rows

Other info

Follow for update