Language Models are Universal Embedders

About

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also present detailed analyses and additional evaluations. We hope that this work could encourage the development of more robust open-source universal embedders.

Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Min Zhang• 2023

Related benchmarks

Task	Dataset	Result
Semantic Textual Similarity	STS-B	Spearman's Rho (x100)66.01	156
Text Embedding	MTEB	--	50
Semantic Textual Similarity (STS)	MTEB English 2023 (test)	BIO85.52	19
Semantic Textual Similarity	BQ	Pearson Correlation0.5389	11
Semantic Textual Similarity	PAWSX	Pearson Correlation0.2041	11
Semantic Textual Similarity	AFQMC	Pearson Correlation0.3177	11
Semantic Textual Similarity	ATEC	Pearson Correlation0.3328	11
Semantic Textual Similarity	LCQMC	Pearson Correlation0.5369	11
Semantic Textual Similarity	QBQTC	Pearson Correlation0.2088	11
NLI	CMNLI (test)	Acc72.34	9

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord