Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

About

Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.

Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura• 2024

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Composite	Kendall-c Tau_c57.6	161
Image Captioning Evaluation	Flickr8K-CF	Kendall-b Correlation (tau_b)37.8	145
Image Captioning Evaluation	Flickr8k Expert	Kendall Tau-c (tau_c)56.4	114
Image Captioning Evaluation	Flickr8K Expert (test)	Kendall tau_c56.4	76
Image Captioning Evaluation	Pascal-50S (test)	HC70	66
Image Captioning Evaluation	Nebula	Kendall tau_c55	66
Image Captioning Evaluation	Flickr8K-CF (test)	Kendall tau_b37.8	65
Correlation with human judgment	Flickr8K-CF	Tau B37.8	48
Image Captioning Evaluation	Pascal-50S	Accuracy86.5	44
Image Captioning Evaluation	FOIL	Accuracy (4-ref)95.4	33

Showing 10 of 31 rows

Other info

Code

Follow for update

@wizwand_team Discord