Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TabSCM: A practical Framework for Generating Realistic Tabular Data

About

Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.

Sven Jacob, Bardh Prenkaj, Weijia Shao, Gjergji Kasneci• 2026

Related benchmarks

TaskDatasetResultRank
ClassificationDiabetes real (test)
AUC0.945
10
RegressionHousing real (test)
RMSE0.253
10
Semantic ConsistencyAdult, Housing, Loan, and HELOC
Violation Rate (S1)6.95
10
RegressionBeijing real (test)
RMSE0.594
10
ClassificationAdult (test)
AUC0.842
10
Column-wise density estimationDiabetes
Error Rate1.73
9
Correlation EstimationHousing
Error Rate1.86
9
DetectabilityHousing
C2ST Score99.5
9
ClassificationLoan (test)
AUC59.1
9
Column-wise density estimationHELOC
Error Rate (%)2.09
9
Showing 10 of 41 rows

Other info

Follow for update