CLIPScore: A Reference-free Evaluation Metric for Image Captioning

About

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi• 2021

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy74.57	2019
Visual Question Answering	VizWiz	Accuracy43	1820
Text-based Visual Question Answering	TextVQA	Accuracy54.7	962
Image Classification	ImageNet 1k (test)	Top-1 Accuracy27.3	880
Science Question Answering	ScienceQA	Accuracy69.06	791
Multimodal Understanding	SEED-Bench	Accuracy61.7	516
Visual Question Answering	VQA v2	Accuracy73.4	333
Scientific Question Answering	ScienceQA image	Accuracy65	259
Multi-modal Evaluation	MME	MME Score1.57e+3	160
Image Captioning Evaluation	Composite	Kendall-c Tau_c57.3	131

Showing 10 of 161 rows

...

Other info

Follow for update

@wizwand_team Discord