Contrastive Learning of Sentence Embeddings from Scratch

About

Contrastive learning has been the dominant approach to train state-of-the-art sentence embeddings. Previous studies have typically learned sentence embeddings either through the use of human-annotated natural language inference (NLI) data or via large-scale unlabeled sentences in an unsupervised manner. However, even in the case of unlabeled data, their acquisition presents challenges in certain domains due to various reasons. To address these issues, we present SynCSE, a contrastive learning framework that trains sentence embeddings with synthesized data. Specifically, we explore utilizing large language models to synthesize the required data samples for contrastive learning, including (1) producing positive and negative annotations given unlabeled sentences (SynCSE-partial), and (2) generating sentences along with their corresponding annotations from scratch (SynCSE-scratch). Experimental results on sentence similarity and reranking tasks indicate that both SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, and SynCSE-partial even achieves comparable performance to the supervised models in most settings.

Junlei Zhang, Zhenzhong Lan, Junxian He• 2023

Related benchmarks

Task	Dataset	Result
Semantic Textual Similarity	STS tasks (STS12, STS13, STS14, STS15, STS16, STS-B, SICK-R) various (test)	STS12 Score76.15	412
Sentence Classification	SentEval Transfer tasks (test)	MR87.42	73
Reranking	AskUbuntu (test)	MAP55.22	16
Reranking	MindSmall (test)	MAP30.56	16
Reranking	SCIDOCS (test)	MAP71.33	16
Reranking	StackOverflow (test)	MAP40.06	16
Semantic Textual Similarity	Multilingual STS low-resource languages	STS Score (Afr)81.4	12
QA retrieval	Indic QA Retrieval	nDCG@10 (Hin)52.3	12
Retrieval	MIRACL Hard Negative	nDCG@10 (Hindi)16.7	12
Retrieval	Belebele	Afr Score84.1	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord