Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

About

We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Aditya Gorla, Ryan Wang, Zhengtong Liu, Ulzee An, Sriram Sankararaman• 2025

Related benchmarks

TaskDatasetResultRank
Tabular ImputationMissBench (test)
MCAR Score0.337
15
Tabular Data ImputationMissBench (overall)
MCAR Score65.9
15
ImputationOpenML MCAR, Missing Probability 0.4 (test)
MAD0.1
13
Showing 3 of 3 rows

Other info

Follow for update