Text Embeddings by Weakly-Supervised Contrastive Pre-training
About
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy75 | 1460 | |
| Natural Language Inference | RTE | Accuracy68.5 | 367 | |
| Multi-hop Question Answering | 2WikiMultihopQA | EM46.33 | 278 | |
| Reading Comprehension | BoolQ | Accuracy71 | 219 | |
| Natural Language Inference | SNLI | Accuracy53.7 | 174 | |
| Topic Classification | AG-News | Accuracy90.6 | 173 | |
| Sentiment Analysis | SST-2 | Accuracy92.4 | 156 | |
| Common Sense Reasoning | COPA | Accuracy84 | 138 | |
| Multi-hop Question Answering | MuSiQue | EM21.39 | 106 | |
| Multi-hop Question Answering | Bamboogle | Exact Match44 | 97 |