Learning Retrieval Models with Sparse Autoencoders

About

Sparse autoencoders (SAEs) provide a powerful mechanism for decomposing the dense representations produced by Large Language Models (LLMs) into interpretable latent features. We posit that SAEs constitute a natural foundation for Learned Sparse Retrieval (LSR), whose objective is to encode queries and documents into high-dimensional sparse representations optimized for efficient retrieval. In contrast to existing LSR approaches that project input sequences into the vocabulary space, SAE-based representations offer the potential to produce more semantically structured, expressive, and language-agnostic features. Building on this insight, we introduce SPLARE, a method to train SAE-based LSR models. Our experiments, relying on recently released open-source SAEs, demonstrate that this technique consistently outperforms vocabulary-based LSR in multilingual and out-of-domain settings. SPLARE-7B, a multilingual retrieval model capable of producing generalizable sparse latent embeddings for a wide range of languages and domains, achieves top results on MMTEB's multilingual and English retrieval tasks. We also developed a 2B-parameter variant with a significantly lighter footprint.

Thibault Formal, Maxime Louis, Herv\'e Dejean, St\'ephane Clinchant• 2026

Related benchmarks

Task	Dataset	Result
Multi-lingual retrieval	MIRACL (dev)	Avg Score71.7	51
Multilingual Retrieval	MTEB Multilingual v2	nDCG@1063.8	40
Retrieval	MTEB eng v2	nDCG@1061.4	31
Retrieval	XTREME-UP	MRR@1061.4	23
Information Retrieval	MTEB English	nDCG@1059.3	6
Information Retrieval	MTEB Multilingual	nDCG@1062.3	6
Information Retrieval	MTEB Medical	nDCG@1067.7	6
Information Retrieval	MTEB Law	nDCG@1060.8	6
Information Retrieval	MTEB ChemTEB	nDCG@1078.1	6
Information Retrieval	MTEB Code	nDCG@1063	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord