Multilingual E5 Text Embeddings: A Technical Report

About

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei• 2024

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR (test)	--	126
Information Retrieval	BEIR	--	120
Text Embedding	MTEB English v2	Mean Score65.5	107
Information Retrieval	BRIGHT	Mean nDCG@1017.9	94
Multilingual Information Retrieval	XQuAD	Completion@1058.99	80
Cross-lingual Information Retrieval	Belebele	Comp@100.6867	80
Information Retrieval	BEIR v1.0.0 (test)	ArguAna54.4	75
Text Classification	N24News (test)	Macro F149.3	52
Multi-lingual retrieval	MIRACL (dev)	Avg Score66.6	51
Information Retrieval	FIQA BEIR (test)	nDCG@1048.4	44

Showing 10 of 87 rows

...

Other info

Follow for update

@wizwand_team Discord