DivCon: Divide and Conquer for Complex Numerical and Spatial Reasoning in Text-to-Image Generation

About

Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements in recent years. To further improve T2I models' capability in numerical and spatial reasoning, layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods often rely on closed-source, large-scale LLMs for layout prediction, limiting accessibility and scalability. They also struggle with generating images from prompts with multiple objects and complicated spatial relationships. To tackle these challenges, we introduce a divide-and-conquer approach which decouples the generation task into multiple subtasks. First, the layout prediction stage is divided into numerical & spatial reasoning and bounding box visual planning, enabling even lightweight LLMs to achieve layout accuracy comparable to large-scale models. Second, the layout-to-image generation stage is divided into two steps to synthesize objects from easy ones to difficult ones. Experiments are conducted on the HRS and NSR-1K benchmarks and our method outperforms previous approaches with notable margins. In addition, visual results and user study demonstrate that our approach significantly improves the perceptual quality, especially when generating multiple objects from complex textural prompts.

Yuhao Jia, Wenhan Tan• 2024

Related benchmarks

Task	Dataset	Result
Layout prediction	HRS numerical	Precision93.28	11
Layout prediction	NSR-1K spatial	Accuracy94.97	11
Layout prediction	HRS-Spatial	Accuracy81.46	11
Layout prediction	NSR-1K numerical	Precision97.45	11
Numerical Reasoning	HRS	Precision78.65	8
Numerical Reasoning	NSR-1K	Precision85.41	8
Spatial Reasoning	HRS	Accuracy53.96	8
Spatial Reasoning	NSR-1K	Accuracy71.65	8
Text-to-Image Generation	NSR-1K	FID21.01	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord