T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation

About

Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-${\alpha}$, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	T2I-CompBench++	Non-Spatial0.7708	65
Text-to-Image Generation	T2I-CompBench	Evaluation Time (min)141	12
Text-to-Image Generation	T2I-CompBench Color ++	M-LPIPS0.621	3
Text-to-Image Generation	User Study SD 2.1	Preference Rate19.17	3
Text-to-Image Generation	User Study SD 3	Preference Rate21.67	3
Visual Generative Model Evaluation	Visual Generative Evaluation Frameworks	Required Samples1.80e+4	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord