Nonparametric Masked Language Modeling
About
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM fills in the [MASK] solely from retrieving a token from a text corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 16 tasks including classification, fact probing and question answering demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better at dealing with rare patterns (word senses or facts) and predicting rare or nearly unseen words (e.g., non-Latin script). We release the model and code at github.com/facebookresearch/NPM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Inference | RTE | Accuracy61.7 | 367 | |
| Subjectivity Classification | Subj | Accuracy75.5 | 266 | |
| Text Classification | AG-News | Accuracy74.5 | 248 | |
| Sentiment Classification | SST-2 | Accuracy87.2 | 174 | |
| Sentiment Classification | MR | Accuracy83.7 | 148 | |
| Sentiment Classification | CR | Accuracy81.2 | 142 | |
| Text Classification | AGNews | Accuracy74.5 | 28 | |
| Topic Classification | Yahoo Answers Topics | Accuracy53.9 | 26 | |
| Sentiment Analysis | Rotten Tomato | Accuracy86 | 25 | |
| Open-set knowledge retrieval | T-REx (All) | Macro-averaged EM34.5 | 19 |