SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval
About
In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Passage retrieval | MsMARCO (dev) | MRR@1041.1 | 116 | |
| Passage retrieval | Natural Questions (NQ) (test) | Top-20 Accuracy85.2 | 45 | |
| Passage Ranking | TREC DL 2019 (test) | NDCG@1069.8 | 33 | |
| Passage Ranking | TREC DL 2019 | NDCG@100.714 | 24 | |
| Web Search Retrieval | TREC DL 20 | nDCG@1069.7 | 22 | |
| Web Search Retrieval | TREC DL 19 | nDCG@1071.4 | 22 | |
| Information Retrieval | SciFact BEIR (test) | nDCG@1062.4 | 22 | |
| Information Retrieval | DBPedia BEIR (test) | nDCG@1034.9 | 18 | |
| Web Search Retrieval | MS MARCO (dev) | MRR@100.411 | 17 | |
| Passage retrieval | MS MARCO (dev) | MRR@1041.1 | 17 |