Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

About

Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .

Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu• 2025

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 256x256	IS274	517
Image Reconstruction	ImageNet 256x256	rFID0.95	202
Image Reconstruction	ImageNet (val)	rFID0.56	143
Class-conditional generation	ImageNet 256 x 256 1k (val)	--	104
Image Reconstruction	ImageNet-1K 1.0 (val)	rFID0.95	35
Image Reconstruction	ImageNet-1K 256x256	rFID0.95	31
Image Reconstruction	ImageNet 256x256 2012 (test val)	rFID0.95	25
Image Reconstruction	ImageNet 50K 256x256 (val)	rFID0.95	16
Image Tokenization	nuScenes	PSNR27.91	8
Image Generation	CIFAR-10	rFID4.91	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord