Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

About

Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate that our method will unleash users' creativity by accurately following more complex prompts. Our code, demo, and benchmark are available at: https://llm-grounded-diffusion.github.io

Long Lian, Boyi Li, Adam Yala, Trevor Darrell• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationT2I-CompBench
Shape Fidelity51.04
185
Video GenerationVBench
Quality Score81.86
126
Text-to-Video GenerationT2V-CompBench
Consistency Attribute Score0.861
63
Video GenerationVBench
Background Consistency97.19
16
Controllable Image Generation (Counting)COUNTLOOP-S Single Category
Counting MAE16.62
15
Controllable Image Generation (Counting)COUNTLOOP-M Multi Categories
Counting MAE6.34
15
Controllable Image Generation (Counting)COCO-Count Single Category
Counting MAE3.09
15
Controllable Image Generation (Counting)T2I-CompBench Single Category
Counting MAE5.56
15
Text-to-Image AlignmentRareBench
Property (Single Object)23.8
11
Text-to-Image AlignmentRareBench v1 (test)
Property Alignment23.8
11
Showing 10 of 11 rows

Other info

Follow for update