SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

About

In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Passage retrieval	MsMARCO (dev)	MRR@1041.1	116
Passage retrieval	Natural Questions (NQ) (test)	Top-20 Accuracy85.2	45
Information Retrieval	SciFact BEIR (test)	nDCG@1062.4	36
Passage Ranking	TREC DL 2019 (test)	NDCG@1069.8	33
Passage Ranking	TREC DL 2019	NDCG@100.714	32
Passage Ranking	TREC DL 2020	NDCG@100.697	24
Web Search Retrieval	TREC DL 20	nDCG@1069.7	22
Web Search Retrieval	TREC DL 19	nDCG@1071.4	22
Information Retrieval	DBPedia BEIR (test)	nDCG@1034.9	21
Web Search Retrieval	MS MARCO (dev)	MRR@100.411	17

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord