OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
About
As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy11.02 | 507 | |
| Commonsense Reasoning | StoryCloze | Accuracy67.13 | 34 | |
| Reading Comprehension | RACE-m | Accuracy0.2577 | 28 | |
| Zero-shot Language Understanding and Reasoning | BENCH-PROXY (MMLU, ANLI, HellaSwag, PIQA, SIQA, W.G., ARC-E, ARC-C, C.QA, WSC) (test) | MMLU33.83 | 24 | |
| Reading Comprehension | RACE | -- | 12 | |
| Natural Language Inference | AX-b | Accuracy58.42 | 9 | |
| Natural Language Inference | AX-g | Accuracy50.56 | 9 |