Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

About

In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a $1.45\times$ reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, Mansheej Paul• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
895
ReasoningBBH
Accuracy9.88
726
Mathematical ReasoningAIME 2024 (test)--
209
Mathematical ReasoningAMC23
Average@1674.6
63
Mathematical ReasoningMath Benchmarks Aggregate--
62
Math ReasoningAMC 2023 (test)
Pass@127.3
57
Mathematical ReasoningOlympiad
Avg@16 Accuracy62.9
47
Math ReasoningOlympiad
Average Rate @1660.9
38
Mathematical ReasoningGSM8K (test)
Pass@1 Accuracy75
34
Commonsense ReasoningStoryCloze
Accuracy67.34
34
Showing 10 of 26 rows

Other info

Follow for update