Multilingual E5 Text Embeddings: A Technical Report
About
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Information Retrieval | BEIR (test) | -- | 76 | |
| Information Retrieval | BEIR | -- | 59 | |
| Information Retrieval | BEIR v1.0.0 (test) | ArguAna54.4 | 55 | |
| Text Embedding | MTEB English v2 | Mean Score65.5 | 50 | |
| Multilingual Text Embedding | MTEB Multilingual | Mean Score (Task)63.2 | 29 | |
| Multilingual Information Retrieval | XQuAD | -- | 21 | |
| LoRA Retrieval | CARLoS LoRA Retrieval Evaluation Set (test) | Top-1 Accuracy57.5 | 20 | |
| Retrieval | MTEB-E English v2 | MTEB-E Retrieval Score53.47 | 16 | |
| Passage retrieval | MS MARCO passage (dev) | NDCG@100.4134 | 14 | |
| Passage retrieval | TREC DL 2019 (evaluation) | NDCG@100.6943 | 14 |