TIGEr: Text-to-Image Grounding for Image Caption Evaluation

About

This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based on how well a caption represents image content, but also on how well machine-generated captions match human-generated captions. Our empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics. We also comprehensively assess the metric's effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores.

Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, Jianfeng Gao• 2019

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c45.4	131
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)49.3	82
Image Captioning Evaluation	Flickr8K Expert (test)	Kendall tau_c49.3	76
Image Captioning Evaluation	Pascal-50S (test)	HC56	66
Image Captioning Evaluation	Pascal-50S	Accuracy80.7	44
Image Captioning	Flickr8k-EX	Tau-c0.493	22
Caption-level correlation with human judgment	Composite (test)	Kendall's Tau0.454	21
Correlation with Human Judgments	Composite (test)	Kendall's Tau-c45.4	18
Image Captioning Evaluation	COMPOSITE (COM) (test)	Kendall's tau-c45.4	17
Image-to-Text Retrieval	NoCaps	R@163.8	17

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord