Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

About

The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over 1 million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use this dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 24% of cases involving correct identification of body parts.

Christoph Leiter, Yuki M. Asano, Margret Keuper, Steffen Eger• 2025

Related benchmarks

TaskDatasetResultRank
T2I Metric EvaluationCROCsyn Forward Text-Based--
18
T2I Metric EvaluationCROCsyn Inverse Text-Based--
18
T2I Metric EvaluationCROCsyn Forward Image-Based--
18
T2I Metric EvaluationCROCsyn Inverse Image-Based--
18
Text-to-image generation evaluationGenAI-Bench
Kendall Tau B (Basic)0.446
5
Text-to-Image Metric Meta-EvaluationTIFA (Original)
Kendall Correlation0.55
5
Text-to-Image Metric Meta-EvaluationTIFA DSG
Kendall Correlation0.538
5
Vision-Language UnderstandingWinoground
Text Accuracy61.5
5
Showing 8 of 8 rows

Other info

Follow for update