DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning
About
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning Evaluation | Composite | Kendall-c Tau_c66 | 92 | |
| Image Captioning Evaluation | Flickr8k Expert | Kendall Tau-c (tau_c)57.5 | 73 | |
| Image Captioning Evaluation | Flickr8K-CF | Kendall-b Correlation (tau_b)40.5 | 62 | |
| Image Captioning Evaluation | Pascal-50S | Mean Score87.8 | 39 | |
| Image Captioning Evaluation | MCEval 1.0 (test) | Real Style Score87.8 | 12 | |
| Hallucination Detection | FOIL 1-ref | Accuracy98.2 | 6 | |
| Hallucination Detection | FOIL (4-ref) | Accuracy98.2 | 6 |