SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

About

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M\"uller, Joe Penna, Robin Rombach• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU18.6	1028
Text-to-Image Generation	GenEval	Overall Score56	704
Semantic segmentation	Cityscapes	mIoU16.1	668
Text-to-Image Generation	GenEval	Overall Score56	517
Text-to-Image Generation	DPG-Bench	Overall Score74.7	451
Text-to-Image Generation	GenEval	GenEval Score62	442
Text-to-Image Generation	GenEval	Overall Score55.05	277
Text-to-Image Generation	DPG	Overall Score74.65	256
Text-to-Image Generation	GenEval (test)	Two Obj. Acc74	250
Text-to-Image Generation	MJHQ-30K	Overall FID8.76	239

Showing 10 of 378 rows

...

Other info

Follow for update

@wizwand_team Discord