MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
About
Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image-Text Alignment Evaluation | Flickr8k Expert 36 (test) | Tau-c58.3 | 9 | |
| Image-Text Alignment Evaluation | Composite 37 (test) | Kendall's Tau-c64.2 | 9 | |
| Image-Text Alignment Evaluation | Pascal-50S 14 (test) | HC69.4 | 9 | |
| Image-Text Alignment Evaluation | Flickr8k CrowdFlower 36 | Kendall's Tau_b39.4 | 8 | |
| Factual mistake detection | DocENT (PoSh) | Accuracy64.8 | 6 | |
| Counterfactual Hallucination Detection | ROCO v2 | Pairwise Accuracy74.9 | 3 | |
| Counterfactual Hallucination Detection | RSICD | Pairwise Accuracy80.5 | 3 | |
| Omission Detection | DocENT (PoSh) | Accuracy62.1 | 3 |