BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

About

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara• 2024

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c57.2	131
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)36.3	115
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)55.8	82
Image Captioning Evaluation	Pascal-50S	Accuracy82.9	44
Image Captioning Evaluation	FOIL	--	10
Image-Text Alignment Evaluation	Flickr8k Expert 36 (test)	Tau-c55.8	9
Image-Text Alignment Evaluation	Composite 37 (test)	Kendall's Tau-c57.2	9
Image-Text Alignment Evaluation	Pascal-50S 14 (test)	HC61.2	9
Image-Text Alignment Evaluation	Flickr8k CrowdFlower 36	Kendall's Tau_b36.3	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord