Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

About

Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can be very competitive to latent models both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss-weighting (Kingma & Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at a high resolution with fewer parameters, rather than using more parameters at a lower resolution. Combining these with guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, Tim Salimans• 2024

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256--
441
Image GenerationImageNet 256x256 (val)
FID1.38
307
Class-conditional Image GenerationImageNet 256x256 (train)--
305
Class-conditional Image GenerationImageNet 256x256 (val)
FID1.38
293
Image GenerationImageNet 256x256
FID1.38
243
Image GenerationImageNet 512x512 (val)
FID-50K1.48
184
Class-conditional Image GenerationImageNet 256x256 (train val)
FID1.38
178
Class-conditional Image GenerationImageNet 512x512 (val)--
69
Class-conditional Image GenerationImageNet 512x512 (train)
FID1.5
52
Image GenerationImageNet 256x256 (test val)--
35
Showing 10 of 17 rows

Other info

Follow for update