Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

About

Most existing image captioning evaluation metrics focus on assigning a single numerical score to a caption by comparing it with reference captions. However, these methods do not provide an explanation for the assigned score. Moreover, reference captions are expensive to acquire. In this paper, we propose FLEUR, an explainable reference-free metric to introduce explainability into image captioning evaluation metrics. By leveraging a large multimodal model, FLEUR can evaluate the caption against the image without the need for reference captions, and provide the explanation for the assigned score. We introduce score smoothing to align as closely as possible with human judgment and to be robust to user-defined grading criteria. FLEUR achieves high correlations with human judgment across various image captioning evaluation benchmarks and reaches state-of-the-art results on Flickr8k-CF, COMPOSITE, and Pascal-50S within the domain of reference-free evaluation metrics. Our source code and results are publicly available at: https://github.com/Yebin46/FLEUR.

Yebin Lee, Imseong Park, Myungjoo Kang• 2024

Related benchmarks

TaskDatasetResultRank
Image Captioning EvaluationComposite
Kendall-c Tau_c65.7
131
Image Captioning EvaluationFlickr8K-CF
Kendall-b Correlation (tau_b)39
115
Image Captioning EvaluationFlickr8k Expert
Kendall Tau-c (tau_c)53
82
Image Captioning EvaluationFlickr8K Expert (test)
Kendall tau_c53
76
Image Captioning EvaluationPascal-50S (test)
HC68
66
Image Captioning EvaluationFlickr8K-CF (test)
Kendall tau_b38.8
65
Correlation with human judgmentFlickr8K-CF
Tau B38.8
48
Image Captioning EvaluationNebula
Kendall tau_c52.7
47
Hallucination DetectionBRACE Hallucination 1.0 (test)
AudioCaps Score98.48
46
Compositional ReasoningVALSE
Average Score87.7
44
Showing 10 of 68 rows

Other info

Code

Follow for update