How to Train Data-Efficient LLMs
About
The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH (test) | Pass@169.8 | 151 | |
| Multiple-choice Question Answering | ARC Easy (test) | Accuracy66.88 | 50 | |
| Multiple-choice Question Answering | ARC Challenge (test) | Accuracy35.15 | 26 | |
| Instruction Tuning | Natural Instructions Meta Non-IID | Rouge-L30.49 | 22 | |
| Instruction Tuning | Dolly-15K alpha=0.5 | Rouge-L33.71 | 22 | |
| Instruction Tuning | Dolly-15K alpha=5.0 | Rouge-L33.88 | 22 | |
| Multiple-choice Question Answering | MMLU (test) | Accuracy25.55 | 12 | |
| Federated Learning | Dolly-15K | Speedup9.42 | 10 | |
| Federated Learning | Natural Instructions (NI) | Speedup10.5 | 10 |