Towards Efficient Data Valuation Based on the Shapley Value
About
"How much is my data worth?" is an increasingly common question posed by organizations and individuals alike. An answer to this question could allow, for instance, fairly distributing profits among multiple data contributors and determining prospective compensation when data breaches happen. In this paper, we study the problem of data valuation by utilizing the Shapley value, a popular notion of value which originated in cooperative game theory. The Shapley value defines a unique payoff scheme that satisfies many desiderata for the notion of data value. However, the Shapley value often requires exponential time to compute. To meet this challenge, we propose a repertoire of efficient algorithms for approximating the Shapley value. We also demonstrate the value of each training instance for various benchmark datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Fashion MNIST | Accuracy86.8 | 16 | |
| Data Valuation | UCI ADULT (train) | p10.22 | 5 | |
| Data Valuation | Synthetic n=3,000 | Wall-clock Time (s)1.07e+3 | 5 | |
| Data Valuation | UCI Adult | Wall-clock Time (min)48 | 5 | |
| Data Valuation | Fashion MNIST | Wall-clock Time (hr)3.6 | 5 | |
| Data Valuation | Criteo-1B | Wall-clock Time (hr)29.1 | 5 | |
| Binary Classification | Synthetic | Val. Stability0.072 | 4 | |
| Binary income prediction | UCI Adult | Balanced Accuracy81.9 | 4 | |
| Click-Through Rate Prediction | Criteo-1B | Test AUC0.6142 | 4 |