Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

About

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image generation stage is divided into two steps to synthesize objects from easy ones to difficult ones. Experiments are conducted on the HRS and NSR-1K benchmarks and our method outperforms previous approaches with notable margins. In addition, visual results and user study demonstrate that our approach significantly improves the perceptual quality, especially when generating multiple objects from complex textural prompts.

Yuhao Jia, Wenhan Tan• 2024

Related benchmarks

TaskDatasetResultRank
Layout predictionHRS numerical
Precision93.28
11
Layout predictionNSR-1K spatial
Accuracy94.97
11
Layout predictionHRS-Spatial
Accuracy81.46
11
Layout predictionNSR-1K numerical
Precision97.45
11
Numerical ReasoningHRS
Precision78.65
8
Numerical ReasoningNSR-1K
Precision85.41
8
Spatial ReasoningHRS
Accuracy53.96
8
Spatial ReasoningNSR-1K
Accuracy71.65
8
Text-to-Image GenerationNSR-1K
FID21.01
3
Showing 9 of 9 rows

Other info

Follow for update