Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

About

Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa• 2025

Related benchmarks

TaskDatasetResultRank
Image Captioning EvaluationComposite
Kendall-c Tau_c66
92
Image Captioning EvaluationFlickr8k Expert
Kendall Tau-c (tau_c)57.5
73
Image Captioning EvaluationFlickr8K-CF
Kendall-b Correlation (tau_b)40.5
62
Image Captioning EvaluationPascal-50S
Mean Score87.8
39
Image Captioning EvaluationMCEval 1.0 (test)
Real Style Score87.8
12
Hallucination DetectionFOIL 1-ref
Accuracy98.2
6
Hallucination DetectionFOIL (4-ref)
Accuracy98.2
6
Showing 7 of 7 rows

Other info

Follow for update