SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
About
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score56 | 467 | |
| Text-to-Image Generation | GenEval | GenEval Score55 | 277 | |
| Text-to-Image Generation | DPG-Bench | Overall Score74.65 | 173 | |
| Text-to-Image Generation | GenEval (test) | Two Obj. Acc74 | 169 | |
| Text-to-Image Generation | DPG | Overall Score74.65 | 131 | |
| Text-to-Image Generation | MS-COCO 2014 (val) | -- | 128 | |
| Text-to-Image Generation | T2I-CompBench | Shape Fidelity54.08 | 94 | |
| Image Reconstruction | ImageNet 256x256 | rFID0.68 | 93 | |
| Text-to-Image Generation | DPG-Bench | DPG Score74.7 | 89 | |
| Text-to-Image Generation | GenEval | Two Objects74 | 87 |