Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

No More K-means: Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

About

Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grained token-level interactions. However, this granularity imposes prohibitive storage and retrieval efficiency bottlenecks: to manage the immense memory footprint and computational overhead of billion-scale token vectors, state-of-the-art systems are forced to rely on aggressive dimension reduction and complex clustering (e.g., K-means). This compromise introduces two critical limitations: excessive indexing latency of clustering large-scale corpora and semantic information loss inherent to compression. In this paper, we propose Single-stage Sparse Retrieval (SSR}, a paradigm shift that replaces expensive clustering with efficient sparse coding. Instead of compressing features into low-dimensional dense vectors, we utilize Sparse Autoencoder (SAE) to project token embeddings into a high-dimensional but highly sparse representation. This transformation enables us to bypass vector clustering entirely and leverage inverted indexing for precise, high-throughput retrieval. Extensive experiments on the BEIR benchmark demonstrate that SSR achieves a "trifecta" of improvements: it reduces indexing time by 15x compared to ColBERTv2, halves retrieval latency, and simultaneously improves retrieval performance over leading baselines.

Lixuan Guo, Yifei Wang, Tiansheng Wen, Aosong Feng, Stefanie Jegelka, Chenyu You• 2026

Related benchmarks

TaskDatasetResultRank
Information RetrievalMS-MARCO (test)--
56
Zero-shot Information RetrievalBEIR
NFCorpus NDCG@10 (Zero-shot)39.1
38
Information RetrievalLoTTE Search (test)
Lifestyle Score87.6
9
Information RetrievalLoTTE Forum (test)
IR Score (Lifestyle)79.7
9
Information RetrievalBEIR and MSMARCO (test)
MS62
9
Information RetrievalMS MARCO Passage
nDCG@100.455
7
Information RetrievalLIMIT diagnostic benchmark
Recall@578.6
6
Information RetrievalMS MARCO Document Ranking
nDCG@1048.8
5
Passage RankingMS-MARCO passage ranking
Peak Memory (GB)34.6
4
Showing 9 of 9 rows

Other info

Follow for update