MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
About
Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Editing | PIE-Bench | PSNR22.78 | 116 | |
| Image Editing | PIE-Bench (test) | PSNR22.19 | 46 | |
| Semantic Editing | LSUN church | CLIP-Score0.219 | 28 | |
| Image-to-Image Translation (Appearance Divergence) | LAION Mini | Structure Similarity94.1 | 20 | |
| Image-to-Image Translation (Appearance Consistency) | LAION Mini | Structure Similarity0.937 | 20 | |
| Image Semantic Editing | PIE-Bench (test) | PSNR22.2 | 18 | |
| Image Editing | PIE-Bench | Distance 10324.46 | 17 | |
| Text-Guided Image Editing | General Image Editing | Speedup1.12 | 12 | |
| Image Editing | SNR-Bench 1.0 (test) | Reward Model Structural Score3.04 | 12 | |
| Image Editing | ImageNet real-edit | CS Score31.4 | 11 |