TabSCM: A practical Framework for Generating Realistic Tabular Data
About
Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | Diabetes real (test) | AUC0.945 | 10 | |
| Regression | Housing real (test) | RMSE0.253 | 10 | |
| Semantic Consistency | Adult, Housing, Loan, and HELOC | Violation Rate (S1)6.95 | 10 | |
| Regression | Beijing real (test) | RMSE0.594 | 10 | |
| Classification | Adult (test) | AUC0.842 | 10 | |
| Column-wise density estimation | Diabetes | Error Rate1.73 | 9 | |
| Correlation Estimation | Housing | Error Rate1.86 | 9 | |
| Detectability | Housing | C2ST Score99.5 | 9 | |
| Classification | Loan (test) | AUC59.1 | 9 | |
| Column-wise density estimation | HELOC | Error Rate (%)2.09 | 9 |