On Losses for Modern Language Models
About
BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks -- sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant -- that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERT Base on the GLUE benchmark using fewer than a quarter of the training tokens.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE (dev) | -- | 504 | |
| Natural Language Processing | SuperGLUE 100 samples, excl. ReCoRD (dev) | Macro Avg Score59.03 | 13 | |
| Natural Language Processing | SuperGLUE 1k samples, excl. ReCoRD (dev) | Macro Avg Score65.21 | 13 | |
| Natural Language Processing | SuperGLUE Full, excl. ReCoRD (dev) | Macro Avg Score69.16 | 13 | |
| Natural Language Processing | GLUE 1k samples (dev) | Macro Avg Score75.3 | 13 | |
| Natural Language Processing | GLUE 100 samples (dev) | Macro Avg Score61.39 | 13 |