Masked Language Model Scoring

About

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at https://github.com/awslabs/mlm-scoring.

Julian Salazar, Davis Liang, Toan Q. Nguyen, Katrin Kirchhoff• 2019

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (dev-other)	WER16.16	535
Linguistic Minimal Pair Scoring	BLiMP	Overall Accuracy86.5	49
ASR rescoring	WSJ (test)	WER6.46	35
ASR rescoring	LibriSpeech (test-other)	WER10.33	21
ASR rescoring	LibriSpeech clean (test)	WER5.25	21
ASR rescoring	MTDialogue (test)	WER0.0905	11
ASR rescoring	ConvAI (test)	WER5.38	11
ASR rescoring	VoxPopuli (test)	WER10.33	11
ASR rescoring	SLURP (test)	WER24.48	11
ASR rescoring	LibriSpeech (dev-clean)	WER5.03	9

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord