EmbeddingGemma: Powerful and Lightweight Text Representations
About
We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and geometric embedding distillation. We improve model robustness and expressiveness with a spread-out regularizer, and ensure generalizability by merging checkpoints from varied, optimized mixtures. Evaluated on the Massive Text Embedding Benchmark (MTEB) across multilingual, English, and code domains, EmbeddingGemma (300M) achieves state-of-the-art results. Notably, it outperforms prior top models, both proprietary and open, with fewer than 500M parameters, and provides performance comparable to models double its size, offering an exceptional performance-to-cost ratio. Remarkably, this lead persists when quantizing model weights or truncating embedding outputs. This makes EmbeddingGemma particularly well-suited for low-latency and high-throughput use cases such as on-device applications. We provide ablation studies exploring our key design choices. We release EmbeddingGemma to the community to promote further research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Information Retrieval | BEIR | -- | 120 | |
| Text Embedding | MTEB English v2 | Mean Score69.7 | 107 | |
| Semantic Textual Similarity | BIOSSES | Spearman Correlation80.46 | 55 | |
| Text Classification | N24News (test) | Macro F171.09 | 52 | |
| Information Retrieval | COVID | nDCG@1050.36 | 50 | |
| Multilingual Retrieval | MTEB Multilingual v2 | nDCG@1062.5 | 40 | |
| Information Retrieval | NFCorpus | nDCG@1031.42 | 33 | |
| Retrieval | MTEB eng v2 | nDCG@1054.6 | 31 | |
| Multilingual Text Embedding | MTEB Multilingual | Mean Score (Task)61.1 | 29 | |
| Information Retrieval | LongEmbed | NDCG@1055.4 | 26 |