Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

About

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

Gon\c{c}alo Gomes, Bruno Martins, Chrysoula Zerva• 2026

Related benchmarks

TaskDatasetResultRank
Image Captioning EvaluationFlickr8K-CF
Kendall-b Correlation (tau_b)39.6
115
Image Captioning EvaluationNebula
Kendall tau_c56.7
47
Compositional ReasoningVALSE
Average Score89.2
44
Vision-Language Compositional ReasoningWinoground 1.0 (test)
Text Score55.5
23
Hallucination DetectionSugarCrepe 1.0 (test)
Avg-M Score88.7
18
Object Hallucination Detectionnocaps FOIL (Out-Domain)
AP87
17
Visio-Linguistic Compositional PerceptionLongCapVLCP 1.0 (test)
Micro Avg ACC90.5
17
Object Hallucination Detectionnocaps-FOIL (Overall)
AP87.3
17
Object Hallucination Detectionnocaps FOIL In-Domain
AP87.9
17
Object Hallucination Detectionnocaps-FOIL (Near-Domain)
AP87.5
17
Showing 10 of 15 rows

Other info

Follow for update