Instruction Pre-Training: Language Models are Supervised Multitask Learners
About
Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | -- | 1043 | |
| Instruction Following | AlpacaEval | -- | 420 | |
| Reasoning | MMLU-Pro | Accuracy25.52 | 241 | |
| Question Answering | MedMCQA | Accuracy42.08 | 98 | |
| Reasoning | GPQA | Accuracy28.28 | 88 | |
| Medical Reasoning | MedMCQA | Accuracy42.08 | 58 | |
| Reasoning | MMLU | Accuracy54.11 | 54 | |
| Language Understanding | MMLU stratified sampling 50 samples per category | Accuracy54.11 | 14 | |
| Language Understanding | MMLU-Pro stratified sampling: 150 samples per category | Accuracy25.52 | 14 | |
| Knowledge-focused evaluation | MixEval Hard | Accuracy16.7 | 8 |