Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

About

In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in \textit{Multi-Linguality}, \textit{Multi-Functionality}, and \textit{Multi-Granularity}. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy86.67
1891
Multi-hop Question AnsweringHotpotQA (test)
F151.5
255
Multi-hop Question Answering2WikiMultiHopQA (test)
EM1
195
Multi-hop Question Answering2WikiMQA
F1 Score42.3
161
Question Answering2Wiki
F124.2
152
Sentiment AnalysisSST-5
Accuracy46.41
106
Composed Image Retrieval (Image-Text to Image)CIRR
Recall@534
93
Information RetrievalBEIR (test)--
90
Commonsense Question AnsweringCommonsenseQA
Accuracy87.71
83
Multilingual Information RetrievalXQuAD
Completion@1077.9
80
Showing 10 of 229 rows
...

Other info

Code

Follow for update