Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReMasker: Imputing Tabular Data with Masked Autoencoding

About

We present ReMasker, a new method of imputing missing values in tabular data by extending the masked autoencoding framework. Compared with prior work, ReMasker is both simple -- besides the missing values (i.e., naturally masked), we randomly ``re-mask'' another set of values, optimize the autoencoder by reconstructing this re-masked set, and apply the trained model to predict the missing values; and effective -- with extensive evaluation on benchmark datasets, we show that ReMasker performs on par with or outperforms state-of-the-art methods in terms of both imputation fidelity and utility under various missingness settings, while its performance advantage often increases with the ratio of missing data. We further explore theoretical justification for its effectiveness, showing that ReMasker tends to learn missingness-invariant representations of tabular data. Our findings indicate that masked modeling represents a promising direction for further research on tabular data imputation. The code is publicly available.

Tianyu Du, Luca Melis, Ting Wang• 2023

Related benchmarks

TaskDatasetResultRank
Classification33 datasets missing rate <= 10% (test)
AUC86.56
65
Classification10 Datasets Missing rate > 10% (test)
AUC80.14
50
Data ImputationNPHA
Accuracy65.38
30
Data ImputationGliomas
Accuracy83.45
30
Data ImputationCancer
Accuracy42.52
28
Data ImputationDiabetes (1/3 omitted)
Accuracy57.16
16
Tabular Data ImputationMissBench (overall)
MCAR Score80.5
15
Tabular ImputationMissBench (test)
MCAR Score0.327
15
Data ImputationHousing
MAE0.0948
14
Data ImputationWine
MAE0.078
14
Showing 10 of 15 rows

Other info

Follow for update