Finetuned Language Models Are Zero-Shot Learners
About
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | OpenBookQA | Accuracy77.4 | 465 | |
| Natural Language Inference | RTE | Accuracy79.9 | 367 | |
| Instruction Following | IFEval | Accuracy (0-100)75.9 | 292 | |
| Instruction Following | AlpacaEval 2.0 | LC Win Rate33.1 | 281 | |
| Reading Comprehension | BoolQ | Accuracy83.6 | 219 | |
| Natural Language Inference | SNLI | Accuracy62.3 | 174 | |
| General Knowledge | MMLU | MMLU General Knowledge Accuracy67.7 | 170 | |
| Mathematical Problem Solving | MATH | Accuracy51.7 | 166 | |
| Question Answering | ARC | Accuracy71 | 154 | |
| Natural Language Inference | MNLI (matched) | Accuracy60.8 | 110 |