Diffusion Probabilistic Model Made Slim
About
Despite the recent visually-pleasing results achieved, the massive computational cost has been a long-standing flaw for diffusion probabilistic models (DPMs), which, in turn, greatly limits their applications on resource-limited platforms. Prior methods towards efficient DPM, however, have largely focused on accelerating the testing yet overlooked their huge complexity and sizes. In this paper, we make a dedicated attempt to lighten DPM while striving to preserve its favourable performance. We start by training a small-sized latent diffusion model (LDM) from scratch, but observe a significant fidelity drop in the synthetic images. Through a thorough assessment, we find that DPM is intrinsically biased against high-frequency generation, and learns to recover different frequency components at different time-steps. These properties make compact networks unable to represent frequency dynamics with accurate high-frequency estimation. Towards this end, we introduce a customized design for slim DPM, which we term as Spectral Diffusion (SD), for light-weight image synthesis. SD incorporates wavelet gating in its architecture to enable frequency dynamic feature extraction at every reverse steps, and conducts spectrum-aware distillation to promote high-frequency recovery by inverse weighting the objective based on spectrum magni tudes. Experimental results demonstrate that, SD achieves 8-18x computational complexity reduction as compared to the latent diffusion models on a series of conditional and unconditional image generation tasks while retaining competitive image fidelity.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 (train) | -- | 305 | |
| Image Generation | LSUN Bedroom 256x256 (test) | FID5.2 | 73 | |
| Image Generation | LSUN Church 256x256 (test) | FID8.4 | 55 | |
| Unconditional Image Generation | FFHQ 256x256 (test) | FID10.5 | 25 | |
| Unconditional image synthesis | CelebA-HQ 256 x 256 (test) | FID9.3 | 22 | |
| Class-conditional Image Generation | ImageNet (train) | FID10.6 | 7 | |
| Text-to-Image Generation | MS-COCO 50k descriptions randomly selected (train) | FID18.43 | 5 |