Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

About

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions, while achieving 3$\times$ lower retrieval latency and 10$\times$ smaller index size.

Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates• 2025

Related benchmarks

TaskDatasetResultRank
Information RetrievalBEIR
Average NDCG@100.544
62
Information RetrievalLIMIT benchmark (test)
Recall@226.2
46
Multi-lingual retrievalMIRACL (dev)
Avg Score72.3
43
Information RetrievalFIQA BEIR (test)
nDCG@1042.7
32
Multilingual Long-context RetrievalMLDR
nDCG@1074.4
28
Multilingual RetrievalMTEB Multilingual v2--
28
Cross-lingual retrievalMKQA
Avg. Recall@10076.6
27
Information RetrievalNFCorpus BEIR
nDCG36.3
22
Information RetrievalFEVER BEIR
nDCG0.834
22
Nugget Coverage RerankingCRUX-MDS DUC 2004 (test)
nDCG70.4
18
Showing 10 of 24 rows

Other info

Follow for update