RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

About

Despite pre-training's progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented pre-training paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence is polluted for encoder and decoder with different masks. The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked language modeling. 2) Asymmetric model structure, with a full-scale BERT like transformer as encoder, and a one-layer transformer as decoder. 3) Asymmetric masking ratios, with a moderate ratio for encoder: 15~30%, and an aggressive ratio for decoder: 50~70%. Our framework is simple to realize and empirically competitive: the pre-trained models dramatically improve the SOTA performances on a wide range of dense retrieval benchmarks, like BEIR and MS MARCO. The source code and pre-trained models are made publicly available at https://github.com/staoxiao/RetroMAE so as to inspire more interesting research.

Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao• 2022

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR	SciFact0.531	120
Passage retrieval	MsMARCO (dev)	MRR@1041.6	116
Retrieval	MS MARCO (dev)	MRR@100.3553	84
Information Retrieval	BEIR v1.0.0 (test)	ArguAna43.3	75
Passage Ranking	TREC DL 2019	NDCG@100.681	32
Information Retrieval	MS MARCO DL2019	nDCG@1068.8	26
Document Retrieval	MS MARCO Document (dev)	MRR@1000.432	24
Passage Ranking	TREC DL 2020	NDCG@100.706	24
Passage retrieval	MS MARCO (dev)	MRR@1041.6	17
Dense Retrieval	BEIR zero-shot	TREC-COVID77.2	13

Showing 10 of 18 rows

Other info

Code

Follow for update

@wizwand_team Discord