Improving Data Efficiency via Curating LLM-Driven Rating Systems
About
Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that "more can be less."
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | -- | 1891 | |
| Reading Comprehension | BoolQ | Accuracy83.45 | 279 | |
| Logical reasoning | LogiQA | LogiQA Accuracy27.44 | 181 | |
| Multilingual Question Answering | TyDiQA | Accuracy55.7 | 65 | |
| Question Answering | TruthfulQA | TruthfulQA Score49.57 | 61 | |
| Language Understanding | MMLU | MMLU Score (x100)65.77 | 21 | |
| Science Question Answering | ARC-C | ARC-C Score (x100)53.49 | 21 |