Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

About

Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

Danilo Brajovic, David A. Kreplin, Marco F. Huber• 2026

Related benchmarks

TaskDatasetResultRank
ClassificationAdult
Accuracy73.5
86
Image ClassificationCIFAR10
Accuracy60.5
70
ClassificationBBC
Accuracy95
61
ClassificationIMDB
Accuracy81.3
56
Classificationnomao
Accuracy87.2
46
ClassificationPOL
Accuracy79.6
36
Showing 6 of 6 rows

Other info

Follow for update