Text Embeddings by Weakly-Supervised Contrastive Pre-training

About

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy75	1896
Natural Language Inference	RTE	Accuracy68.5	590
Multi-hop Question Answering	2WikiMultihopQA	EM46.33	559
Reading Comprehension	BoolQ	Accuracy71	279
Common Sense Reasoning	COPA	Accuracy84	256
Topic Classification	AG-News	Accuracy90.6	225
Multi-hop Question Answering	MuSiQue	EM21.39	209
Natural Language Inference	SNLI	Accuracy53.7	196
Sentiment Analysis	SST-2	Accuracy92.4	165
Multi-hop Question Answering	HotpotQA	Exact Match (EM)43.56	150

Showing 10 of 176 rows

...

Other info

Code

Follow for update

@wizwand_team Discord