KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
About
Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Information Retrieval | BEIR | -- | 59 | |
| Text Embedding | MTEB English v2 | Mean Score71.3 | 50 | |
| Multilingual Text Embedding | MTEB Multilingual | Mean Score (Task)72.3 | 29 | |
| Retrieval | MTEB-E English v2 | MTEB-E Retrieval Score58.45 | 16 | |
| Retrieval | RTEB Multilingual Public | RTEB56.51 | 11 | |
| Multilingual Retrieval | MTEB Multilingual v2 | MTEB-M Score57.9 | 11 | |
| Retrieval | LongEmbed | Long Task Score43.35 | 11 | |
| Question Answering | Locomo | BLEU-144.4 | 6 | |
| Question Answering | LongMemEval | Accuracy55.6 | 6 |