EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

About

Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.

Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho• 2025

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c65	161
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)39.3	145
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)56.7	114
Image Captioning Evaluation	Nebula	Kendall tau_c54.9	66
Compositional Reasoning	VALSE	Average Score85.2	65
Vision-Language Compositional Reasoning	Winoground 1.0 (test)	Text Score40	23
Hallucination Detection	SugarCrepe 1.0 (test)	Avg-M Score89.7	18
Object Hallucination Detection	nocaps-FOIL (Overall)	AP91.1	17
Object Hallucination Detection	nocaps FOIL In-Domain	AP88.8	17
Object Hallucination Detection	nocaps-FOIL (Near-Domain)	AP92.6	17

Showing 10 of 21 rows

Other info

Code

Follow for update

@wizwand_team Discord