Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers

About

Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the $Q_{\delta}$ statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.

Aivin V. Solatorio, Olivier Dupriez• 2023

Related benchmarks

TaskDatasetResultRank
Tabular Classificationdiabetes 37 (test)
Test Error73.2
15
Tabular Data UtilityAdult (test)
AUC0.925
14
Tabular Data UtilityDefault (test)
AUC0.764
14
Tabular Data UtilityMagic (test)
AUC0.931
14
Tabular Data UtilityCalifornia (test)
AUC0.948
14
Tabular Data SynthesisAggregate of five tabular datasets (full train vs original train)
Marginal Error5.31
13
BGP ClassificationBGP
Mean ΔG-15.35
10
Tabular ClassificationBU (test)
MLE Score0.928
6
Tabular ClassificationAB Adult (test)
MLE Loss0.504
6
Tabular RegressionCA (California Housing) (test)
MLE Loss0.808
6
Showing 10 of 12 rows

Other info

Follow for update