Learning Over Dirty Data with Minimal Repairs

About

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce concepts of minimal and almost minimal repair, which are subsets of missing data items in training data whose imputation delivers accurate and reasonably accurate models, respectively. Imputing these subsets can significantly reduce the time, computational resources, and manual effort required for learning. We show that finding these subsets is NP-hard for some popular models and propose efficient approximation algorithms for wide range of models. Our extensive experiments indicate that our proposed algorithms can substantially reduce the time and effort required to learn on incomplete datasets.

Cheng Zhen, Prayoga, Nischal Aryal, Arash Termehchy, Garrett Biwer, Lubna Alzamil• 2025

Related benchmarks

Task	Dataset	Result
Linear regression	Gas 40% MNAR	Time (s)0.274	32
Linear regression	Concrete 40% MNAR	Time (s)0.0217	32
Linear regression	Concrete 20% MNAR	Time (s)0.0179	32
Classification	Tuadromd 20% MCAR	Inference Time (s)0.404	23
Classification	Tuadromd 40% MCAR	Inference Time (s)0.306	23
Classification	Tuadromd 60% MCAR	Inference Time (s)0.034	23
Classification	Credit Default 20% MCAR	Execution Time (s)0.624	23
Classification	Credit Default 40% MCAR	Inference Time (s)0.481	23
Classification	Credit Default 60% MCAR	Time (s)0.249	23
Linear regression	Superconductivity 20 MCAR injected (test)	Time (s)0.12	19

Showing 10 of 76 rows

...

Other info

Follow for update

@wizwand_team Discord