BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model
About
Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning Evaluation | Flickr8K-CF | Kendall-b Correlation (tau_b)39.6 | 115 | |
| Image Captioning Evaluation | Nebula | Kendall tau_c56.7 | 47 | |
| Compositional Reasoning | VALSE | Average Score89.2 | 44 | |
| Vision-Language Compositional Reasoning | Winoground 1.0 (test) | Text Score55.5 | 23 | |
| Hallucination Detection | SugarCrepe 1.0 (test) | Avg-M Score88.7 | 18 | |
| Object Hallucination Detection | nocaps FOIL (Out-Domain) | AP87 | 17 | |
| Visio-Linguistic Compositional Perception | LongCapVLCP 1.0 (test) | Micro Avg ACC90.5 | 17 | |
| Object Hallucination Detection | nocaps-FOIL (Overall) | AP87.3 | 17 | |
| Object Hallucination Detection | nocaps FOIL In-Domain | AP87.9 | 17 | |
| Object Hallucination Detection | nocaps-FOIL (Near-Domain) | AP87.5 | 17 |