Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Data Shapley: Equitable Valuation of Data for Machine Learning

About

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

Amirata Ghorbani, James Zou• 2019

Related benchmarks

TaskDatasetResultRank
Image ClassificationFashion MNIST
Accuracy87.9
16
Label Noise IdentificationMNIST (train)
AUC0.933
15
Dataset RemovalCIFAR10 embedding
DR61
12
Dataset Removalbbc-embedding
DR0.89
12
Noisy label detectionCIFAR10 embedding
NLD13
12
Noisy label detectionbbc-embedding
NLD Score0.12
12
Dataset AdditionCIFAR10 embedding
DA Score0.18
12
Noisy label detectionIMDB embedding
NLD28
12
Dataset Additionbbc-embedding
DA Score12
12
Dataset RemovalIMDB embedding
DR75
12
Showing 10 of 59 rows

Other info

Follow for update