Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

About

Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

Danilo Brajovic, David A. Kreplin, Marco F. Huber• 2026

Related benchmarks

Task	Dataset	Result
Classification	Adult	Accuracy73.5	86
Image Classification	CIFAR10	Accuracy60.5	70
Classification	IMDB	Accuracy81.3	62
Classification	BBC	Accuracy95	61
Classification	nomao	Accuracy87.2	46
Classification	POL	Accuracy79.6	36

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord