Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models
About
Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S$^2$-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S$^2$-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)259.1 | 815 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID2.4 | 427 | |
| Text-to-Video Generation | VBench | Quality Score84.89 | 155 | |
| Class-conditional Image Generation | ImageNet 512x512 (val) | FID (Val)6.2 | 97 | |
| Text-to-Image Generation | COCO 2014 (val) | Precision69.5 | 34 | |
| Text-to-Image Generation | HPS v2.1 | Score (Anime)31.48 | 30 | |
| Text-to-Image Generation | MS COCO 1K | HPSv2.129.614 | 18 | |
| Text-to-Image Generation | LAION 5B 1K | HPSv2.128.491 | 18 |