How to Train Data-Efficient LLMs

About

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH (test)	Pass@169.8	151
Multiple-choice Question Answering	ARC Easy (test)	Accuracy66.88	68
Multiple-choice Question Answering	ARC Challenge (test)	Accuracy35.15	57
Instruction Tuning	Natural Instructions Meta Non-IID	Rouge-L30.49	22
Instruction Tuning	Dolly-15K alpha=0.5	Rouge-L33.71	22
Instruction Tuning	Dolly-15K alpha=5.0	Rouge-L33.88	22
Multiple-choice Question Answering	MMLU (test)	Accuracy25.55	14
Federated Learning	Dolly-15K	Speedup9.42	10
Federated Learning	Natural Instructions (NI)	Speedup10.5	10
Language Modeling Downstream Evaluation	English Downstream Tasks 12 tasks	Reading Comprehension42.83	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord