What You See is What You Read? Improving Text-Image Alignment Evaluation
About
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Compositional Vision-Language Reasoning | Winoground | Text Score47 | 47 | |
| Video Question Answering | NExT-QA ATPhard | Overall Accuracy39 | 27 | |
| Image-Text Matching | Winoground | -- | 26 | |
| Classification | Pets | AURC0.213 | 23 | |
| Image-Text Matching | VL-Checklist | AURC0.234 | 23 | |
| Image-Text Matching | FOIL | AURC0.223 | 23 | |
| Classification | UCF101 | AURC0.171 | 23 | |
| Classification | Flowers | AURC0.214 | 23 | |
| Image-Text Matching | What’sUp | AURC23.6 | 23 | |
| element-level text-to-image alignment evaluation | EvalMuse-40K | SRCC67.9 | 17 |