Null-text Inversion for Editing Real Images using Guided Diffusion Models
About
Recent text-guided diffusion models provide powerful image generation capabilities. Currently, a massive effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing. To edit a real image using these state-of-the-art tools, one must first invert the image with a meaningful text prompt into the pretrained model's domain. In this paper, we introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image. Our proposed inversion consists of two novel key components: (i) Pivotal inversion for diffusion models. While current methods aim at mapping random noise samples to a single input image, we use a single pivotal noise vector for each timestamp and optimize around it. We demonstrate that a direct inversion is inadequate on its own, but does provide a good anchor for our optimization. (ii) NULL-text optimization, where we only modify the unconditional textual embedding that is used for classifier-free guidance, rather than the input text embedding. This allows for keeping both the model weights and the conditional embedding intact and hence enables applying prompt-based editing while avoiding the cumbersome tuning of the model's weights. Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and prompt editing, showing high-fidelity editing of real images.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Editing | PIE-Bench | PSNR27.03 | 116 | |
| Subject-driven image generation | DreamBench | DINO Score56.9 | 62 | |
| Instructive image editing | EMU Edit (test) | CLIP Image Similarity0.761 | 46 | |
| Image Editing | PIE-Bench (test) | -- | 46 | |
| Image Editing | User Study 100 images (test) | User Selection Rate65.1 | 32 | |
| Image Editing | AnyEdit (test) | CLIP Score (Input)0.773 | 28 | |
| Dehazing | RESIDE | FID39.94 | 25 | |
| Image Editing | PIE-Bench 1.0 (test) | PSNR30.21 | 22 | |
| Image-to-Image Translation (Appearance Divergence) | LAION Mini | Structure Similarity95 | 20 | |
| Image-to-Image Translation (Appearance Consistency) | LAION Mini | Structure Similarity0.947 | 20 |