CIDEr: Consensus-based Image Description Evaluation

About

Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.

Ramakrishna Vedantam, C. Lawrence Zitnick, Devi Parikh• 2014

Related benchmarks

Task	Dataset	Result
Multimodal Sentiment Analysis	CMU-MOSI (test)	F183.8	385
Image Captioning Evaluation	Composite	Kendall-c Tau_c37.7	131
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)43.6	115
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)43.9	82
Image Captioning Evaluation	Flickr8K Expert (test)	Kendall tau_c43.9	76
Multimodal Sentiment Analysis	CMU-MOSI v1 (test)	Accuracy (2-Class)81.1	72
Image Captioning Evaluation	Pascal-50S (test)	HC66.5	66
Image Captioning Evaluation	Flickr8K-CF (test)	Kendall tau_b24.6	65
Multimodal Sentiment Analysis	CMU-MOSI 43 (test)	2-Class Accuracy81.1	56
Correlation with human judgment	Flickr8K-CF	Tau B24.6	48

Showing 10 of 46 rows

Other info

Follow for update

@wizwand_team Discord