Multitask Prompted Training Enables Zero-Shot Task Generalization
About
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-task Language Understanding | MMLU | Accuracy36.9 | 842 | |
| Question Answering | ARC Challenge | -- | 749 | |
| Reasoning | BBH | Accuracy13 | 507 | |
| Question Answering | OpenBookQA | Accuracy59.11 | 465 | |
| Natural Language Inference | RTE | Accuracy80.83 | 367 | |
| Physical Commonsense Reasoning | PIQA | Accuracy67.67 | 329 | |
| Multitask Language Understanding | MMLU (test) | Accuracy43.2 | 303 | |
| Arithmetic Reasoning | MultiArith | Accuracy3.2 | 181 | |
| Common Sense Reasoning | WinoGrande | Accuracy59.94 | 156 | |
| Common Sense Reasoning | COPA | Accuracy91.5 | 138 |