Back to Basics: Let Denoising Generative Models Denoise
About
Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | ImageNet 256x256 (val) | FID1.86 | 307 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | IS292.6 | 305 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID1.86 | 293 | |
| Image Generation | ImageNet 256x256 | FID1.82 | 243 | |
| Class-conditional generation | ImageNet 256 x 256 1k (val) | FID1.82 | 67 | |
| Class-conditional Image Generation | ImageNet-1K 256x256 1.0 (train) | gFID1.86 | 35 | |
| Image Generation | ImageNet 256x256 (test val) | -- | 35 | |
| Image Generation | ImageNet 512x512 | FID1.78 | 34 | |
| Image Generation | ImageNet 256x256 (train val) | FID3.66 | 34 | |
| Unconditional Image Generation | ImageNet-1K 256x256 (val) | -- | 14 |