Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

About

Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.

Manish Nagaraj, Deepak Ravikumar, Kaushik Roy• 2025

Related benchmarks

Task	Dataset	Result
Reasoning	BBH	Accuracy36.12	770
commonsense inference	HellaSwag	Accuracy48.35	171
Social Commonsense Reasoning	SocialIQA	Accuracy43.25	150
Commonsense Reasoning	CommonsenseQA	Accuracy33.1	19
Multitask Language Understanding	MMLU	Accuracy46.13	11
Question Answering	TyDiQA	Accuracy45.58	11

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord