Learning Retrieval Models with Sparse Autoencoders
About
Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned Sparse Retrieval (LSR), whose objective is to encode queries and documents into high-dimensional sparse representations optimized for efficient retrieval. In contrast to existing LSR approaches that project input sequences into the vocabulary space, SAE-based representations offer the potential to produce more semantically structured, expressive, and language-agnostic features. Building on this insight, we introduce SPLARE, a method to train SAE-based LSR models. Our experiments, relying on recently released open-source SAEs, demonstrate that this technique consistently outperforms vocabulary-based LSR in multilingual and out-of-domain settings. SPLARE-7B, a multilingual retrieval model capable of producing generalizable sparse latent embeddings for a wide range of languages and domains, achieves top results on MMTEB's multilingual and English retrieval tasks. We also developed a 2B-parameter variant with a significantly lighter footprint.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-lingual retrieval | MIRACL (dev) | Avg Score71.7 | 43 | |
| Multilingual Retrieval | MTEB Multilingual v2 | nDCG@1063.8 | 28 | |
| Retrieval | XTREME-UP | MRR@1061.4 | 23 | |
| Retrieval | MTEB eng v2 | nDCG@1061.4 | 19 | |
| Information Retrieval | MTEB English | nDCG@1059.3 | 6 | |
| Information Retrieval | MTEB Multilingual | nDCG@1062.3 | 6 | |
| Information Retrieval | MTEB Medical | nDCG@1067.7 | 6 | |
| Information Retrieval | MTEB Law | nDCG@1060.8 | 6 | |
| Information Retrieval | MTEB ChemTEB | nDCG@1078.1 | 6 | |
| Information Retrieval | MTEB Code | nDCG@1063 | 6 |