Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

About

We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at https://pearl.kinsta.page/.

Shinnosuke Hirano, Yuiga Wada, Kazuki Matsuda, Seitaro Otsuki, Komei Sugiura• 2025

Related benchmarks

TaskDatasetResultRank
Image Captioning EvaluationComposite
Kendall-c Tau_c60.4
92
Image Captioning EvaluationFlickr8k Expert
Kendall Tau-c (tau_c)58.6
73
Image Captioning EvaluationFlickr8K-CF
Kendall-b Correlation (tau_b)38.6
62
Hallucination DetectionFOIL
Accuracy (4 Refs)97.2
32
Image Captioning EvaluationNebula
Kendall tau_c55.4
22
Image Captioning EvaluationFOIL
Accuracy (1-ref)96.7
6
Showing 6 of 6 rows

Other info

Follow for update