Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

About

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score50
391
Text-to-Image GenerationT2I-CompBench
Shape Fidelity49.03
185
Text-to-Image GenerationT2I-CompBench
Color Fidelity0.6406
46
Controllable Image Generation (Counting)COUNTLOOP-M Multi Categories
Counting MAE4.34
15
Controllable Image Generation (Counting)T2I-CompBench Single Category
Counting MAE1.47
15
Controllable Image Generation (Counting)COCO-Count Single Category
Counting MAE1.28
15
Controllable Image Generation (Counting)COUNTLOOP-S Single Category
Counting MAE31.85
15
Object-Background Compositional Text-to-Image GenerationObject-Background Compositional T2I Evaluation Dataset
CLIP_I0.338
13
Text-to-Image GenerationUser Study 12 Prompts (test)
Win Rate (Full Description)38.76
13
Text-to-Image AlignmentRareBench
Property (Single Object)33.8
11
Showing 10 of 14 rows

Other info

Follow for update