There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

About

Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our framework achieves state-of-the-art (SOTA) performance on ImageNet. Specifically, our diffusion model reaches an FID of 1.58 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE) surpassing prior pixel-space methods and VAE-based counterparts by a large margin in both generation quality and training efficiency. In a direct comparison, our model significantly outperforms DiT while using only around 30\% of its training compute. Furthermore, our consistency model achieves an impressive FID of 8.82 on ImageNet-256, significantly outperforming its latent-space counterparts. This marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models. Our codes are available at: \href{https://github.com/AMAP-ML/EPG}{https://github.com/AMAP-ML/EPG}

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu• 2025

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256	--	967
Image Generation	ImageNet 256x256	--	517
Class-conditional Image Generation	ImageNet 256x256 (val)	Inception Score (IS)283.2	493
Image Generation	ImageNet 256x256 (train)	FID1.58	211
Class-conditional Image Generation	ImageNet class-conditional 256x256 (test val)	FID8.82	81
Class-conditional Image Generation	ImageNet 512	FID2.35	13

Showing 6 of 6 rows

Other info

GitHub

Follow for update

@wizwand_team Discord