Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

About

Recent research demonstrates the effectiveness of using fine-tuned language models~(LM) for dense retrieval. However, dense retrievers are hard to train, typically requiring heavily engineered fine-tuning pipelines to realize their full potential. In this paper, we identify and address two underlying problems of dense retrievers: i)~fragility to training data noise and ii)~requiring large batches to robustly learn the embedding space. We use the recently proposed Condenser pre-training architecture, which learns to condense information into the dense vector through LM pre-training. On top of it, we propose coCondenser, which adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space. Retrieval experiments on MS-MARCO, Natural Question, and Trivia QA datasets show that coCondenser removes the need for heavy data engineering such as augmentation, synthesis, or filtering, as well as the need for large batch training. It shows comparable performance to RocketQA, a state-of-the-art, heavily engineered system, using simple small batch fine-tuning.

Luyu Gao, Jamie Callan• 2021

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR (test)	TREC-COVID Score71.2	126
Information Retrieval	BEIR	SciFact0.481	120
Passage retrieval	MsMARCO (dev)	MRR@1038.2	116
Retrieval	MS MARCO (dev)	MRR@100.386	84
Passage Ranking	MS MARCO (dev)	MRR@1038.2	73
Retrieval	Natural Questions (test)	Top-5 Recall75.8	62
Passage retrieval	Natural Questions (NQ) (test)	Top-20 Accuracy84.3	45
Passage Ranking	TREC DL 2019	NDCG@100.717	32
Information Retrieval	MS MARCO DL2019	nDCG@1071.5	26
Information Retrieval	Natural Questions (test)	Recall@2084.3	25

Showing 10 of 30 rows

Other info

Code

Follow for update

@wizwand_team Discord