HICEScore: A Hierarchical Metric for Image Captioning Evaluation

About

Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed captions closely resembles interpretable human judgments.Our code is available at https://github.com/joeyz0z/HICE.

Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, Bo Chen• 2024

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c58.7	161
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)38.2	145
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)57.7	114
Correlation with human judgment	Flickr8K-CF	Tau B38.2	48
Image Captioning Evaluation	Pascal-50S	Accuracy86.1	44
Image Captioning Evaluation	FOIL	Accuracy (4-ref)97	33
Hallucination Detection	FOIL	Accuracy (4 Refs)97	32
Correlation with Human Judgments	Flickr8k Expert	Tau-b Correlation57.2	19
Image-Text Alignment Evaluation	Flickr8k Expert 36 (test)	Tau-c56.4	9
Image-Text Alignment Evaluation	Pascal-50S 14 (test)	HC68.6	9

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord