Mask-Predict: Parallel Decoding of Conditional Masked Language Models
About
Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applying this strategy for a constant number of iterations, our model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average. It is also able to reach within about 1 BLEU point of a typical left-to-right transformer model, while decoding significantly faster.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation | WMT En-De 2014 (test) | BLEU27.03 | 379 | |
| Machine Translation | IWSLT De-En 2014 (test) | BLEU33.4 | 146 | |
| Machine Translation | WMT 2014 (test) | BLEU30.86 | 100 | |
| Machine Translation | IWSLT En-De 2014 (test) | BLEU22 | 92 | |
| Machine Translation | WMT En-De '14 | BLEU18.12 | 89 | |
| Machine Translation | WMT Ro-En 2016 (test) | BLEU33.31 | 82 | |
| Machine Translation | WMT14 En-De newstest2014 (test) | BLEU27.03 | 65 | |
| Machine Translation | WMT De-En 14 (test) | BLEU30.53 | 59 | |
| Machine Translation | WMT 2016 (test) | BLEU33.06 | 58 | |
| Machine Translation | WMT16 EN-RO (test) | BLEU33.08 | 56 |