Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Over Dirty Data with Minimal Repairs

About

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce concepts of minimal and almost minimal repair, which are subsets of missing data items in training data whose imputation delivers accurate and reasonably accurate models, respectively. Imputing these subsets can significantly reduce the time, computational resources, and manual effort required for learning. We show that finding these subsets is NP-hard for some popular models and propose efficient approximation algorithms for wide range of models. Our extensive experiments indicate that our proposed algorithms can substantially reduce the time and effort required to learn on incomplete datasets.

Cheng Zhen, Prayoga, Nischal Aryal, Arash Termehchy, Garrett Biwer, Lubna Alzamil• 2025

Related benchmarks

TaskDatasetResultRank
Linear regressionGas 40% MNAR
Time (s)0.274
32
Linear regressionConcrete 40% MNAR
Time (s)0.0217
32
Linear regressionConcrete 20% MNAR
Time (s)0.0179
32
ClassificationTuadromd 20% MCAR
Inference Time (s)0.404
23
ClassificationTuadromd 40% MCAR
Inference Time (s)0.306
23
ClassificationTuadromd 60% MCAR
Inference Time (s)0.034
23
ClassificationCredit Default 20% MCAR
Execution Time (s)0.624
23
ClassificationCredit Default 40% MCAR
Inference Time (s)0.481
23
ClassificationCredit Default 60% MCAR
Time (s)0.249
23
Linear regressionSuperconductivity 20 MCAR injected (test)
Time (s)0.12
19
Showing 10 of 76 rows
...

Other info

Follow for update