Three Creates All: You Only Sample 3 Steps

About

Diffusion models deliver high-fidelity generation but remain slow at inference time due to many sequential network evaluations. We find that standard timestep conditioning becomes a key bottleneck for few-step sampling. Motivated by layer-dependent denoising dynamics, we propose Multi-layer Time Embedding Optimization (MTEO), which freeze the pretrained diffusion backbone and distill a small set of step-wise, layer-wise time embeddings from reference trajectories. MTEO is plug-and-play with existing ODE solvers, adds no inference-time overhead, and trains only a tiny fraction of parameters. Extensive experiments across diverse datasets and backbones show state-of-the-art performance in the few-step sampling and substantially narrow the gap between distillation-based and lightweight methods. Code will be available.

Yuren Cai, Guangyi Wang, Zongqing Li, Li Li, Zhihui Liu, Songzhi Su• 2026

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 256x256	IS211	359
Image Generation	CIFAR-10	FID2.5	203
Text-to-Image Generation	MS-COCO (val)	FID12.92	202
Image Generation	LSUN bedroom	FID4.44	105
Image Generation	ImageNet 64	FID3.81	100
Image Generation	FFHQ	FID2.99	70
Image Generation	CIFAR-10	FID2.5	16

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord