Three Creates All: You Only Sample 3 Steps
About
Diffusion models deliver high-fidelity generation but remain slow at inference time due to many sequential network evaluations. We find that standard timestep conditioning becomes a key bottleneck for few-step sampling. Motivated by layer-dependent denoising dynamics, we propose Multi-layer Time Embedding Optimization (MTEO), which freeze the pretrained diffusion backbone and distill a small set of step-wise, layer-wise time embeddings from reference trajectories. MTEO is plug-and-play with existing ODE solvers, adds no inference-time overhead, and trains only a tiny fraction of parameters. Extensive experiments across diverse datasets and backbones show state-of-the-art performance in the few-step sampling and substantially narrow the gap between distillation-based and lightweight methods. Code will be available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | ImageNet 256x256 | IS211 | 359 | |
| Image Generation | CIFAR-10 | FID2.5 | 203 | |
| Text-to-Image Generation | MS-COCO (val) | FID12.92 | 202 | |
| Image Generation | LSUN bedroom | FID4.44 | 105 | |
| Image Generation | ImageNet 64 | FID3.81 | 100 | |
| Image Generation | FFHQ | FID2.99 | 70 | |
| Image Generation | CIFAR-10 | FID2.5 | 16 |