DiP: Taming Diffusion Models in Pixel Space

About

Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10$\times$ faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256$\times$256.

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, Ying Tai• 2025

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256	Inception Score (IS)282.9	1021
Image Generation	ImageNet 256x256	IS281.9	606
Class-conditional Image Generation	ImageNet 256x256 (val)	Inception Score (IS)282.9	535
Image Generation	ImageNet 512x512 (val)	FID-50K2.31	219
Class-conditional Image Generation	ImageNet 256x256 (train val)	--	203
Class-conditional generation	ImageNet 512x512 (test)	FID2.31	48

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord