GAIN: Missing Data Imputation using Generative Adversarial Nets
About
We propose a novel method for imputing missing data by adapting the well-known Generative Adversarial Nets (GAN) framework. Accordingly, we call our method Generative Adversarial Imputation Nets (GAIN). The generator (G) observes some components of a real data vector, imputes the missing components conditioned on what is actually observed, and outputs a completed vector. The discriminator (D) then takes a completed vector and attempts to determine which components were actually observed and which were imputed. To ensure that D forces G to learn the desired distribution, we provide D with some additional information in the form of a hint vector. The hint reveals to D partial information about the missingness of the original sample, which is used by D to focus its attention on the imputation quality of particular components. This hint ensures that G does in fact learn to generate according to the true data distribution. We tested our method on various datasets and found that GAIN significantly outperforms state-of-the-art imputation methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | Musk2 downstream | Balanced Accuracy93.9 | 45 | |
| Missing Imputation | MIMIC-III Laboratory Data subset (n=5000, p=24) under MAR | RMSE0.061 | 40 | |
| Data Imputation | Gliomas | Accuracy84.13 | 30 | |
| Data Imputation | NPHA | Accuracy60.68 | 30 | |
| Data Imputation | Cancer | Accuracy42.52 | 28 | |
| Missing Data Imputation | eICU Collaborative Research Database Simulation of Blockwise Missing Data n=5000, p=40 | RMSE0.076 | 24 | |
| Time Series Imputation | PEMS-BAY Block missing (test) | MAE2.18 | 21 | |
| Time Series Imputation | PEMS-BAY Point missing (test) | MAE1.88 | 21 | |
| Time Series Imputation | METR-LA Point missing (test) | MAE2.83 | 21 | |
| Missing Data Imputation | eICU Collaborative Research Database Simulation of Blockwise Missing Data n=5000, p=40 | RMSE0.079 | 16 |