Missing Data Imputation using Optimal Transport

About

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi• 2020

Related benchmarks

Task	Dataset	Result
Time Series Imputation	ETTh1	MSE0.936	162
Time Series Imputation	ETTm1	MSE0.965	159
Time Series Imputation	ETTm2	MSE0.927	125
Classification	33 datasets missing rate <= 10% (test)	AUC86.46	65
Time Series Imputation	Exchange	MSE0.783	54
Classification	10 Datasets Missing rate > 10% (test)	AUC81.11	50
Classification	Musk2 downstream	Balanced Accuracy93.6	45
Regression	Energy 0% non-corrupted features	RMSE0.107	15
Regression	Energy 50% non-corrupted features	RMSE0.09	15
Regression	Energy 100% non-corrupted features	RMSE0.085	15

Showing 10 of 41 rows

Other info

Follow for update

@wizwand_team Discord