Transparent Image Layer Diffusion using Latent Transparency
About
We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| RGBA image reconstruction | AIM-500 (test) | PSNR32.0879 | 4 | |
| Text-to-PSD | 50 text prompts User Study (test) | Layering Reasonableness3.33 | 3 | |
| Text-to-PSD generation | Text-to-PSD | FID89.35 | 3 | |
| Background Generation | ∞Bench | CLIP-FID (Compositional)43.2 | 2 | |
| Foreground Generation | ∞Bench | CLIP-FID (Comp.)42 | 2 | |
| Text-to-All Generation | ∞Bench | CLIP-FID (FG)45.2 | 2 |