Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
About
In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can \emph{significantly} improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a $1.45\times$ reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 (test) | -- | 895 | |
| Reasoning | BBH | Accuracy9.88 | 726 | |
| Mathematical Reasoning | AIME 2024 (test) | -- | 209 | |
| Mathematical Reasoning | AMC23 | Average@1674.6 | 63 | |
| Mathematical Reasoning | Math Benchmarks Aggregate | -- | 62 | |
| Math Reasoning | AMC 2023 (test) | Pass@127.3 | 57 | |
| Mathematical Reasoning | Olympiad | Avg@16 Accuracy62.9 | 47 | |
| Math Reasoning | Olympiad | Average Rate @1660.9 | 38 | |
| Mathematical Reasoning | GSM8K (test) | Pass@1 Accuracy75 | 34 | |
| Commonsense Reasoning | StoryCloze | Accuracy67.34 | 34 |