On Distillation of Guided Diffusion Models

About

Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans• 2022

Related benchmarks

Task	Dataset	Result
Image Generation	CIFAR-10	FID5.98	212
Class-conditional Image Generation	ImageNet 64x64	FID7.54	170
Conditional Image Generation	CIFAR10 (test)	Fréchet Inception Distance7.34	92
Text-to-Image Generation	MS COCO zero-shot	FID37.3	64
Text-to-Image Generation	MS-COCO 5K 2017 (val)	FID26.9	34
Class-conditional Image Generation	ImageNet 64x64 (train test)	FID2.05	30
Image Generation	ImageNet 64x64 (train)	FID7.54	21
Text-to-Image Generation	LAION-Aesthetic 6.5+ (test)	FID14.12	20
Text-to-Image Generation	MSCOCO 2017 (5k)	FID (5k)26	9
Text-to-Image Generation	LAION-Aesthetic 6+	FID (1 Step)108.2	5

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord