MCSE: Multimodal Contrastive Learning of Sentence Embeddings
About
Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman's correlation by 1.7%. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@122.5 | 423 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@116.7 | 370 | |
| Semantic Textual Similarity | STS tasks (STS12, STS13, STS14, STS15, STS16, STS-B, SICK-R) | STS12 Score71.7 | 195 | |
| Transfer Learning | SentEval Transfer Learning Tasks (test) | MR82.82 | 52 | |
| Sentence Embedding Evaluation | MTEB (test) | Re-Rank Score46.92 | 48 |