Instruction Tuning with GPT-4
About
Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed. In this paper, we present the first attempt to use GPT-4 to generate instruction-following data for LLM finetuning. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models. We also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training. We make our data generated using GPT-4 as well as our codebase publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | SVAMP (test) | Accuracy39 | 262 | |
| Dialogue Alignment Evaluation | AlignBench | Reasoning5.64 | 90 | |
| Multi-turn Dialogue Evaluation | MT-Bench-zh | Score5.44 | 90 | |
| Reasoning | BBH (test) | Accuracy39.94 | 67 | |
| Red-Teaming (Attack Success Rate) | DANGEROUSQA | ASR3.5 | 50 | |
| Mathematical Reasoning | GSM (test) | Accuracy14.63 | 42 | |
| Red Teaming | ADVERSARIALQA | ASR5.8 | 20 | |
| Red Teaming | CATQA | ASR16.18 | 20 | |
| Instruction Tuning | IT Evaluation Suite MMLU, BBH, GSM, TydiQA, CodeX, AE | MMLU55.7 | 18 | |
| Instruction Following | AlpacaEval v1 (test) | AlpacaEval Score61.8 | 14 |