Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

About

Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara• 2024

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c62	131
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)38.8	115
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)57.9	82
Correlation with human judgment	Flickr8K-CF	Tau B38.8	48
Image Captioning Evaluation	Nebula	Kendall tau_c50.6	47
Compositional Reasoning	VALSE	Average Score73.8	44
Image Captioning Evaluation	Pascal-50S	Accuracy84.7	44
Hallucination Detection	FOIL	Accuracy (4 Refs)94.1	32
Vision-Language Compositional Reasoning	Winoground 1.0 (test)	Text Score31.8	23
Correlation with Human Judgments	Flickr8k Expert	Tau-b Correlation55.3	19

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord