KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

About

As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.

Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, Min Zhang• 2025

Related benchmarks

Task	Dataset	Result
Text Classification	N24News (test)	Macro F177.3	52
User Behavior Prediction	Alipay Industrial User Cognition Benchmark downstream tasks	Concert Performance53.59	16
Text Embedding	AfriMTEB (AMH, GAZ, HAU, IBO, KIN, SWA, XHO, YOR, ZUL) Lite (test)	Hate Speech Score48.5	16
Text Embedding	AfriMTEB Full	Btxt Score49.9	15
Column matching	CovidKG	Recall@1076	10
Column matching	CIUS	Recall@1086	10
Column matching	SAUS	Recall@1085.6	10
Tuple Matching	CancerKG	Recall@1081	10
Tuple Matching	CovidKG	Recall@1075	10
Tuple Matching	Webtable	Recall@1085.8	10

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord