Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MiniPLM: Knowledge Distillation for Pre-Training Language Models

About

Knowledge distillation (KD) is widely used to train small, high-performing student language models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training faces efficiency, flexibility, and effectiveness issues. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. In this work, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher LM's knowledge. For efficiency, MiniPLM performs offline teacher inference, allowing KD for multiple student LMs without adding training costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the training data difficulty and diversity, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to larger training scales, evidenced by the scaling curve extrapolation. Further analysis reveals that MiniPLM supports KD across model families and enhances the pre-training data utilization. Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.

Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
850
Mathematical ReasoningMATH
Accuracy30.84
535
Instruction FollowingIFEval--
292
Science Question AnsweringARC Challenge
Accuracy53.75
234
Graduate-level Question AnsweringGPQA
Accuracy26.34
114
Code GenerationMBPP
Accuracy48
90
General KnowledgeMMLU-Pro
MMLU-Pro General Knowledge Score29.52
38
Common Sense ReasoningBBH
Accuracy57.1
27
Aggregated PerformanceAverage 10 Tasks
Average Accuracy40.65
19
FactualityTruthfulQA
Accuracy29.13
18
Showing 10 of 10 rows

Other info

Follow for update