Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

About

Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.

Shichao Kan, Xuyang Zhang, Haojie Zhang, Zhe Zhu, Yigang Cen, Yixiong Liang, Lianlei Shan, Linna Zhang, Zhe Qu, Jiazhi Xia• 2026

Related benchmarks

TaskDatasetResultRank
Image-Text Alignment EvaluationFlickr8k Expert 36 (test)
Tau-c58.3
9
Image-Text Alignment EvaluationComposite 37 (test)
Kendall's Tau-c64.2
9
Image-Text Alignment EvaluationPascal-50S 14 (test)
HC69.4
9
Image-Text Alignment EvaluationFlickr8k CrowdFlower 36
Kendall's Tau_b39.4
8
Factual mistake detectionDocENT (PoSh)
Accuracy64.8
6
Counterfactual Hallucination DetectionROCO v2
Pairwise Accuracy74.9
3
Counterfactual Hallucination DetectionRSICD
Pairwise Accuracy80.5
3
Omission DetectionDocENT (PoSh)
Accuracy62.1
3
Showing 8 of 8 rows

Other info

Follow for update