Diffusion Self-Distillation for Zero-Shot Customized Image Generation
About
Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cinematic Story Generation | ViStoryBench | CSD (Cross)0.417 | 24 | |
| Personalized Text-to-Image Generation | DreamBench++ Single-subject | CP0.513 | 18 | |
| Image Personalization | User Study Personalization Tasks | Concept Preservation (CP)64.4 | 17 | |
| Personalized Text-to-Image Generation | DreamBench++ (test) | CP Score3.661 | 8 | |
| Multi-object compositing | Multi-object compositing (test) | CLIP-I0.65 | 8 | |
| Personalized Image Generation | DreamBench++ GPT-4o score evaluation (test) | CP (Animal)64.7 | 8 | |
| Continuous Story Generation | AnimeBoard-GT | CSD Cross0.501 | 7 | |
| 3D-conditioned Image Generation | User Study | Faithfulness4.145 | 6 | |
| Identity-preserving Image Generation | 3D Assets (test) | GPT-eval Texture4.842 | 6 |