Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

About

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC's value for measuring the influence of intra- and inter-modality learning.

Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, Yinfei Yang• 2020

Related benchmarks

Task	Dataset	Result
Image-to-Text Retrieval	Crisscrossed Captions (CxC)	R@155.9	20
Text-to-Image Retrieval	Crisscrossed Captions (CxC)	R@141.7	15
Semantic Similarity Ranking	Crisscrossed Captions (test)	SITS61.9	11
Semantic Similarity	Crisscrossed Captions (CxC)	Mean Average74.5	10
Text-to-Text Retrieval	Crisscrossed Captions (CxC)	R@142.4	10
Image-to-Image Retrieval	Crisscrossed Captions (CxC)	R@138.5	10
Semantic Image Similarity	CxC	Average Similarity Score74.5	8
Semantic Image-Text Similarity	CxC	Avg Score61.9	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord