Modeling Tabular data using Conditional GAN

About

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni• 2019

Related benchmarks

Task	Dataset	Result
Tabular Data Synthesis Fidelity	biodeg	KS Statistic (Mean)0.49	90
Tabular Data Synthesis Fidelity	steel	KS Statistic (Mean)0.61	90
Tabular Data Synthesis Fidelity	fourier	KS Fidelity0.67	88
Tabular Data Synthesis Fidelity	PROTEIN	Mean KS Statistic0.69	88
Tabular Data Synthesis Fidelity	Texture	KS Statistic (Mean)0.82	64
Classification	Credit	ROCAUC63.7	63
Cardiac risk prediction	Clinical cardiac rehabilitation dataset	F1 Score (Risk)65.65	60
Classification	Electricity (test)	Accuracy76.45	55
Regression	California Housing (CH) (test)	MSE0.35	52
Classification	UCI Mice Protein (test)	Accuracy93.75	50

Showing 10 of 589 rows

...

Other info

Code

Follow for update

@wizwand_team Discord