Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models

About

In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are available at https://github.com/sungnyun/diffblender.

Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn• 2023

Related benchmarks

TaskDatasetResultRank
Global ColourisationDanbooru 2023
FID232
5
Global ColourisationPlace365 Outdoor
FID144.4
4
Global ColourisationPascalVOC 2012
FID82.15
4
Global ColourisationAFHQ Cat
FID86.82
4
Global ColourisationAFHQ Dog
FID145.5
4
Global ColourisationPlace365 Indoor
FID133.1
4
Showing 6 of 6 rows

Other info

Follow for update