What You See is What You Read? Improving Text-Image Alignment Evaluation

About

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor• 2023

Related benchmarks

Task	Dataset	Result
Compositional Vision-Language Reasoning	Winoground	Text Score47	61
Video Question Answering	NExT-QA ATPhard	Overall Accuracy39	33
Image-Text Matching	Winoground	--	26
Classification	Pets	AURC0.213	23
Image-Text Matching	VL-Checklist	AURC0.234	23
Image-Text Matching	FOIL	AURC0.223	23
Classification	UCF101	AURC0.171	23
Classification	Flowers	AURC0.214	23
Image-Text Matching	What’sUp	AURC23.6	23
element-level text-to-image alignment evaluation	EvalMuse-40K	SRCC67.9	17

Showing 10 of 27 rows

Other info

Code

Follow for update

@wizwand_team Discord