Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

About

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara• 2023

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c57.3	131
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)53.9	115
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)55.9	82
Image Captioning Evaluation	Flickr8K Expert (test)	Kendall tau_c55.9	76
Image Captioning Evaluation	Pascal-50S (test)	HC67.7	66
Image Captioning Evaluation	Flickr8K-CF (test)	Kendall tau_b37.6	65
Correlation with human judgment	Flickr8K-CF	Tau B37.6	48
Image Captioning Evaluation	Nebula	Kendall tau_c51.9	47
Compositional Reasoning	VALSE	Average Score69.5	44
Image Captioning Evaluation	Pascal-50S	--	44

Showing 10 of 66 rows

Other info

Code

Follow for update

@wizwand_team Discord