Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

About

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\"uller, Joe Penna, Robin Rombach• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score56
467
Text-to-Image GenerationGenEval
GenEval Score55
277
Text-to-Image GenerationDPG-Bench
Overall Score74.65
173
Text-to-Image GenerationGenEval (test)
Two Obj. Acc74
169
Text-to-Image GenerationDPG
Overall Score74.65
131
Text-to-Image GenerationMS-COCO 2014 (val)--
128
Text-to-Image GenerationT2I-CompBench
Shape Fidelity54.08
94
Image ReconstructionImageNet 256x256
rFID0.68
93
Text-to-Image GenerationDPG-Bench
DPG Score74.7
89
Text-to-Image GenerationGenEval
Two Objects74
87
Showing 10 of 179 rows
...

Other info

Follow for update