Style Aligned Image Generation via Shared Attention
About
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Style Transfer | User Study | Overall Quality Score74.4 | 30 | |
| Style aligned image generation | 100 text prompts (test) | Text Alignment (CLIP Score)28.9 | 11 | |
| Object Replacement and Style Blending | Object Replacement and Style Blending (800 pairs) (test) | BOSM0.4125 | 11 | |
| Object Replacement and Object Blending | Unsplash 4,000 samples (test) | BOM0.2371 | 10 | |
| Style Transfer | CIFAR-100 and InstaStyle (test) | Content Score28.1 | 9 | |
| Text-to-Image Generation | In-the-wild image color condition | FID73.1 | 7 | |
| Preference-conditioned image generation | PREFBENCH | FID167.5 | 7 | |
| Preference-conditioned image generation | Pick-a-Pic processed | FID200.3 | 7 | |
| Text-to-Image Generation | Sampled color condition (Manual) | FID177 | 7 | |
| Style Transfer | Single image on A100 GPU (test) | Inference Time (s)18 | 7 |