Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
About
By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC's value for measuring the influence of intra- and inter-modality learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic Similarity Ranking | Crisscrossed Captions (test) | SITS61.9 | 11 | |
| Semantic Similarity | Crisscrossed Captions (CxC) | Mean Average74.5 | 10 | |
| Image-to-Text Retrieval | Crisscrossed Captions (CxC) | R@155.9 | 10 | |
| Text-to-Text Retrieval | Crisscrossed Captions (CxC) | R@142.4 | 10 | |
| Image-to-Image Retrieval | Crisscrossed Captions (CxC) | R@138.5 | 10 | |
| Semantic Image Similarity | CxC | Average Similarity Score74.5 | 8 | |
| Semantic Image-Text Similarity | CxC | Avg Score61.9 | 8 | |
| Text-to-Image Retrieval | Crisscrossed Captions (CxC) | R@141.7 | 5 |