Cascaded Diffusion Models for High Fidelity Image Generation
About
We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)158.7 | 441 | |
| Image Generation | ImageNet 256x256 (val) | FID4.88 | 307 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | IS158.7 | 305 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID4.63 | 293 | |
| Image Generation | ImageNet 256x256 | FID4.88 | 243 | |
| Class-conditional Image Generation | ImageNet 256x256 (train val) | FID4.88 | 178 | |
| Class-conditional Image Generation | ImageNet 256x256 (test) | FID4.88 | 167 | |
| Class-conditional Image Generation | ImageNet 64x64 | FID1.48 | 126 | |
| Image Generation | ImageNet 256x256 (train) | FID4.88 | 91 | |
| Image Generation | ImageNet 64x64 (train val) | FID1.48 | 83 |