Scaling Instruction-Finetuned Language Models
About
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-task Language Understanding | MMLU | Accuracy75.2 | 842 | |
| Reasoning | BBH | Accuracy57.9 | 507 | |
| Mathematical Reasoning | GSM8K | Accuracy (GSM8K)78.43 | 358 | |
| Multitask Language Understanding | MMLU (test) | Accuracy75.2 | 303 | |
| Instruction Following | IFEval | Accuracy (0-100)36.69 | 292 | |
| Multi-hop Question Answering | 2WikiMultihopQA | EM25.9 | 278 | |
| Question Answering | BoolQ | Accuracy89.6 | 240 | |
| Summarization | XSum (test) | ROUGE-217.7 | 231 | |
| Question Answering | SciQ | Accuracy95.7 | 226 | |
| Multi-hop Question Answering | HotpotQA | -- | 221 |