Muppet: Massive Multi-task Representations with Pre-Finetuning
About
We propose pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning. Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. We show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. We also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy86.4 | 1460 | |
| Natural Language Inference | RTE | Accuracy39.44 | 367 | |
| Physical Interaction Question Answering | PIQA | Accuracy55.47 | 323 | |
| Boolean Question Answering | BoolQ | Accuracy74.27 | 307 | |
| Question Answering | OBQA | Accuracy39.47 | 276 | |
| Question Answering | BoolQ | Accuracy82.17 | 240 | |
| Question Classification | TREC | Accuracy96.8 | 205 | |
| Topic Classification | AG-News | Accuracy89.77 | 173 | |
| Natural Language Understanding | GLUE (val) | SST-297.4 | 170 | |
| Common Sense Reasoning | WinoGrande | Accuracy55.49 | 156 |