Spectral Guidance for Flexible and Efficient Control of Diffusion Models

About

We introduce Spectral Guidance, a framework for controlling diffusion models by leveraging the intrinsic geometry of the generative process. As data is progressively corrupted by noise, only a small number of features remain informative for control. We characterize them as the singular functions of a conditional expectation operator and show that they can be learned via a self-supervised objective. Once recovered, this basis enables the projection of arbitrary guidance signals, such as labels, CLIP embeddings, or masks, directly onto the sampling trajectory. This approach allows for stable, high-fidelity control without retraining or denoiser backpropagation during sampling. Empirically, we improve conditional accuracy on CIFAR-10 by 37 percentage points over the strongest training-free baseline while offering $4\times$ faster sampling. Moreover, the same representations that support label and CLIP guidance also enable spatial control, such as mask-based guidance, without auxiliary models. Finally, our framework reveals a phase transition in the generative process, pinpointing the optimal time window for effective guidance.

Gabriel Moreira, Manuel Marques, Jo\~ao Paulo Costeira, Chenyan Xiong• 2026

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet	FID183	189
Conditional Image Generation	CIFAR-10	FID70.7	88
Conditional Image Generation	CelebA-HQ Gender+Hair	Accuracy88.3	15
Conditional Image Generation	CelebA-HQ Gender+Age	Accuracy91.5	15
Mask-conditioned image generation	CelebA-HQ	IoU80	8
Text-conditioned image generation (CLIP guidance)	CelebA-HQ	VQAScore64	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord