DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
About
Diffusion probabilistic models (DPMs) have achieved impressive success in high-resolution image synthesis, especially in recent large-scale text-to-image generation applications. An essential technique for improving the sample quality of DPMs is guided sampling, which usually needs a large guidance scale to obtain the best sample quality. The commonly-used fast sampler for guided sampling is DDIM, a first-order diffusion ODE solver that generally needs 100 to 250 steps for high-quality samples. Although recent works propose dedicated high-order solvers and achieve a further speedup for sampling without guidance, their effectiveness for guided sampling has not been well-tested before. In this work, we demonstrate that previous high-order fast samplers suffer from instability issues, and they even become slower than DDIM when the guidance scale grows large. To further speed up guided sampling, we propose DPM-Solver++, a high-order solver for the guided sampling of DPMs. DPM-Solver++ solves the diffusion ODE with the data prediction model and adopts thresholding methods to keep the solution matches training data distribution. We further propose a multistep variant of DPM-Solver++ to address the instability issue by reducing the effective step size. Experiments show that DPM-Solver++ can generate high-quality samples within only 15 to 20 steps for guided sampling by pixel-space and latent-space DPMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Generation | ImageNet 256x256 | IS320.8 | 359 | |
| Text-to-Image Generation | Stable Diffusion V1.4 | RMSE Loss0.078 | 280 | |
| Unconditional Image Generation | CIFAR-10 | FID3.88 | 240 | |
| Image Generation | CIFAR-10 | FID2.91 | 203 | |
| Text-to-Image Generation | MS-COCO (val) | FID15.85 | 202 | |
| Image Generation | ImageNet 256 10k samples | FID7.27 | 165 | |
| Class-conditional Image Generation | ImageNet 128x128 | FID4.3 | 155 | |
| Image Generation | CIFAR-10 32x32 | FID3.08 | 147 | |
| Text-to-Image Generation | Stable Diffusion 1.4 | CLIP Cosine Similarity0.98 | 140 | |
| Unconditional Image Generation | CIFAR-10 32x32 (test) | FID2.99 | 137 |