Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

About

This paper presents the training methodology of Arctic-Embed 2.0, a set of open-source text embedding models built for accurate and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for efficient embedding storage with significantly lower compressed quality degradation compared to alternatives. We detail the design and implementation, presenting several important open research questions that arose during model development. We conduct experiments exploring these research questions and include extensive discussion aimed at fostering further discussion in this field.

Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos• 2024

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR (test)	--	126
Information Retrieval	BEIR	--	120
Text Embedding	MTEB English v2	Mean Score63.6	107
Text Classification	N24News (test)	Macro F157.21	52
Multilingual Retrieval	MTEB Multilingual v2	nDCG@1054.8	40
Retrieval	MTEB eng v2	nDCG@1058.4	31
Multilingual Text Embedding	MTEB Multilingual	Mean Score (Task)57	29
Multilingual Long-context Retrieval	MLDR	nDCG@1034	28
Information Retrieval	LongEmbed	NDCG@1055.4	26
Code Retrieval	MTEB Code	nDCG@1055.2	21

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord