CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

About

We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Aditya Gorla, Ryan Wang, Zhengtong Liu, Ulzee An, Sriram Sankararaman• 2025

Related benchmarks

Task	Dataset	Result
Tabular Imputation	MissBench (test)	MCAR Score0.337	15
Tabular Data Imputation	MissBench (overall)	MCAR Score65.9	15
Imputation	OpenML MCAR, Missing Probability 0.4 (test)	MAD0.1	13

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord