Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

About

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

Manish Nagaraj, Deepak Ravikumar, Kaushik Roy• 2025

Related benchmarks

TaskDatasetResultRank
ReasoningBBH
Accuracy36.12
726
Social Commonsense ReasoningSocialIQA
Accuracy43.25
143
commonsense inferenceHellaSwag
Accuracy48.35
123
Commonsense ReasoningCommonsenseQA
Accuracy33.1
19
Multitask Language UnderstandingMMLU
Accuracy46.13
11
Question AnsweringTyDiQA
Accuracy45.58
11
Showing 6 of 6 rows

Other info

Follow for update