REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
About
Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the $Q_{\delta}$ statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Classification | diabetes 37 (test) | Test Error73.2 | 15 | |
| Tabular Data Utility | Adult (test) | AUC0.925 | 14 | |
| Tabular Data Utility | Default (test) | AUC0.764 | 14 | |
| Tabular Data Utility | Magic (test) | AUC0.931 | 14 | |
| Tabular Data Utility | California (test) | AUC0.948 | 14 | |
| Tabular Data Synthesis | Aggregate of five tabular datasets (full train vs original train) | Marginal Error5.31 | 13 | |
| BGP Classification | BGP | Mean ΔG-15.35 | 10 | |
| Tabular Classification | BU (test) | MLE Score0.928 | 6 | |
| Tabular Classification | AB Adult (test) | MLE Loss0.504 | 6 | |
| Tabular Regression | CA (California Housing) (test) | MLE Loss0.808 | 6 |