Learning Over Dirty Data with Minimal Repairs
About
Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce concepts of minimal and almost minimal repair, which are subsets of missing data items in training data whose imputation delivers accurate and reasonably accurate models, respectively. Imputing these subsets can significantly reduce the time, computational resources, and manual effort required for learning. We show that finding these subsets is NP-hard for some popular models and propose efficient approximation algorithms for wide range of models. Our extensive experiments indicate that our proposed algorithms can substantially reduce the time and effort required to learn on incomplete datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Linear regression | Gas 40% MNAR | Time (s)0.274 | 32 | |
| Linear regression | Concrete 40% MNAR | Time (s)0.0217 | 32 | |
| Linear regression | Concrete 20% MNAR | Time (s)0.0179 | 32 | |
| Classification | Tuadromd 20% MCAR | Inference Time (s)0.404 | 23 | |
| Classification | Tuadromd 40% MCAR | Inference Time (s)0.306 | 23 | |
| Classification | Tuadromd 60% MCAR | Inference Time (s)0.034 | 23 | |
| Classification | Credit Default 20% MCAR | Execution Time (s)0.624 | 23 | |
| Classification | Credit Default 40% MCAR | Inference Time (s)0.481 | 23 | |
| Classification | Credit Default 60% MCAR | Time (s)0.249 | 23 | |
| Linear regression | Superconductivity 20 MCAR injected (test) | Time (s)0.12 | 19 |