Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Missing Data Imputation using Optimal Transport

About

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi• 2020

Related benchmarks

TaskDatasetResultRank
Time Series ImputationETTm1
MSE0.965
110
Time Series ImputationETTh1
MSE0.936
86
Time Series ImputationETTm2
MSE0.927
83
Classification33 datasets missing rate <= 10% (test)
AUC86.46
65
Time Series ImputationExchange
MSE0.783
54
Classification10 Datasets Missing rate > 10% (test)
AUC81.11
50
ClassificationMusk2 downstream
Balanced Accuracy93.6
45
RegressionEnergy 0% non-corrupted features
RMSE0.107
15
RegressionEnergy 50% non-corrupted features
RMSE0.09
15
RegressionEnergy 100% non-corrupted features
RMSE0.085
15
Showing 10 of 35 rows

Other info

Follow for update