Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Instruction Tuning with GPT-4

About

Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed. In this paper, we present the first attempt to use GPT-4 to generate instruction-following data for LLM finetuning. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models. We also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training. We make our data generated using GPT-4 as well as our codebase publicly available.

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, Jianfeng Gao• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningSVAMP (test)
Accuracy39
262
Dialogue Alignment EvaluationAlignBench
Reasoning5.64
90
Multi-turn Dialogue EvaluationMT-Bench-zh
Score5.44
90
ReasoningBBH (test)
Accuracy39.94
67
Red-Teaming (Attack Success Rate)DANGEROUSQA
ASR3.5
50
Mathematical ReasoningGSM (test)
Accuracy14.63
42
Red TeamingADVERSARIALQA
ASR5.8
20
Red TeamingCATQA
ASR16.18
20
Instruction TuningIT Evaluation Suite MMLU, BBH, GSM, TydiQA, CodeX, AE
MMLU55.7
18
Instruction FollowingAlpacaEval v1 (test)
AlpacaEval Score61.8
14
Showing 10 of 19 rows

Other info

Follow for update