Transparent Image Layer Diffusion using Latent Transparency

About

We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.

Lvmin Zhang, Maneesh Agrawala• 2024

Related benchmarks

Task	Dataset	Result
Composite Image Generation	RORem	MUSIQ Score67.221	6
Text to Image	AIM-500	CLIP Score83.6	5
Layered Image Generation	MS-COCO filtered 1,770 images (test)	FID61.46	4
RGBA image reconstruction	AIM-500 (test)	PSNR32.0879	4
Layered Image Synthesis	RORem User Study (20 examples, 18 participants)	FG–Text Alignment2.96	3
Text-to-PSD	50 text prompts User Study (test)	Layering Reasonableness3.33	3
Text-to-PSD generation	Text-to-PSD	FID89.35	3
Background Generation	∞Bench	CLIP-FID (Compositional)43.2	2
Foreground Generation	∞Bench	CLIP-FID (Comp.)42	2
Text-to-All Generation	∞Bench	CLIP-FID (FG)45.2	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord