Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

About

Data curation is a critical yet under-explored area in large language model (LLM) training. Existing methods, such as data selection and mixing, operate in an offline paradigm, detaching themselves from training. This separation introduces engineering overhead and makes the curation brittle: the entire pipeline must be re-run under model/task shifts. Moreover, offline methods alter data size through hard filtering or resampling, often sacrificing data diversity and harming generalization. We propose to rethink data curation as an online reweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static pre-processing. Specifically, we introduce ADAPT (Adaptive Data reweighting for Pretraining and FineTuning), a dynamic online framework that reweights training samples with adaptive per-sample learning rates guided by similarity-based quality signals, without changing the number of training samples. Unlike offline methods that enforce a static data distribution, ADAPT acts as an implicit curriculum learner, progressively shifting focus from coarse-grained patterns to fine-grained semantic distinctions as the model evolves. Experiments on both instruction tuning and large-scale pretraining show that ADAPT consistently outperforms offline selection/mixing and prior online methods, achieving stronger cross-benchmark generalization under equal FLOPs.

Wanru Zhao, Yihong Chen, Yuzhi Tang, Wentao Ma, Shengchao Hu, Shell Xu Hu, Alex Iacob, Abhinav Mehrotra, Nicholas D. Lane• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy50.99
1442
Question AnsweringARC-E
Accuracy39.44
523
Question AnsweringOpenBookQA
Accuracy15.4
305
Question AnsweringARC-C
Accuracy19.11
258
Common Sense ReasoningCOPA
Accuracy64
256
Logical reasoningLogiQA
LogiQA Accuracy21.66
251
Commonsense ReasoningPIQA
Accuracy61.48
213
Commonsense ReasoningSocialIQA
Accuracy37.05
158
Reading ComprehensionMultiRC
MultiRC Accuracy56.53
25
Reading ComprehensionRACE
Score27.39
15
Showing 10 of 13 rows

Other info

Follow for update