Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

About

In this work, we present a novel direction to build an image tokenizer directly on top of a frozen vision foundation model, which is a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 1.36 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code is available at https://github.com/CVMI-Lab/VFMTok.

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, Xiaojuan Qi• 2025

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)280.4
967
Image ClassificationImageNet-1k (val)
Top-1 Accuracy69.4
920
Image GenerationImageNet 256x256
IS278.8
517
Class-conditional Image GenerationImageNet 256x256 (train)
IS276
367
Image ReconstructionImageNet 256x256
rFID0.89
202
Image ReconstructionImageNet-1k 256 x 256 (val)
rFID0.89
112
Class-conditional Image GenerationImageNet 256x256 2012 (val)
FID2.16
63
Class-conditional Image GenerationImageNet class-conditional 256x256
Inception Score (IS)278.8
61
Image ReconstructionImageNet 256x256 2012 (val)
rFID0.89
43
Image ReconstructionImageNet-1K 256x256
rFID0.89
31
Showing 10 of 14 rows

Other info

Follow for update