Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
About
Diffusion probabilistic models (DPMs) have achieved remarkable quality in image generation that rivals GANs'. But unlike GANs, DPMs use a set of latent variables that lack semantic meaning and cannot serve as a useful representation for other tasks. This paper explores the possibility of using DPMs for representation learning and seeks to extract a meaningful and decodable representation of an input image via autoencoding. Our key idea is to use a learnable encoder for discovering the high-level semantics, and a DPM as the decoder for modeling the remaining stochastic variations. Our method can encode any image into a two-part latent code, where the first part is semantically meaningful and linear, and the second part captures stochastic details, allowing near-exact reconstruction. This capability enables challenging applications that currently foil GAN-based methods, such as attribute manipulation on real images. We also show that this two-level encoding improves denoising efficiency and naturally facilitates various downstream tasks including few-shot conditional sampling. Please visit our project page: https://Diff-AE.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | CelebA 64 x 64 (test) | FID22.7 | 203 | |
| Unconditional Image Generation | CelebA unconditional 64 x 64 | FID4.97 | 95 | |
| Unconditional Image Generation | FFHQ 256x256 | FID5.81 | 64 | |
| Image Reconstruction | CelebA-HQ (test) | -- | 50 | |
| Image Generation | CelebA (test) | FID22.7 | 49 | |
| Image Generation | FFHQ 256x256 (test) | FID5.81 | 30 | |
| Image Reconstruction | FFHQ No glasses | LPIPS0.014 | 18 | |
| Image Reconstruction | FFHQ Glasses | LPIPS0.014 | 18 | |
| Disentanglement | CelebA-HQ (test) | Disentanglement64.39 | 13 | |
| Image Classification | CelebA-HQ (test) | F1 Score68.7 | 13 |