Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

About

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi• 2021

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy74.57
2019
Visual Question AnsweringVizWiz
Accuracy43
1820
Text-based Visual Question AnsweringTextVQA
Accuracy54.7
962
Image ClassificationImageNet 1k (test)
Top-1 Accuracy27.3
880
Science Question AnsweringScienceQA
Accuracy69.06
791
Multimodal UnderstandingSEED-Bench
Accuracy61.7
516
Visual Question AnsweringVQA v2
Accuracy73.4
333
Scientific Question AnsweringScienceQA image
Accuracy65
259
Multi-modal EvaluationMME
MME Score1.57e+3
160
Image Captioning EvaluationComposite
Kendall-c Tau_c57.3
131
Showing 10 of 161 rows
...

Other info

Follow for update