Language Models are Realistic Tabular Data Generators

About

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as variational autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across numerous real-world and synthetic data sets with heterogeneous feature types coming in various sizes.

Vadim Borisov, Kathrin Se{\ss}ler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci• 2022

Related benchmarks

Task	Dataset	Result
Anomaly Detection	WBC	ROCAUC0.908	151
Regression	California Housing (CH) (test)	MSE0.28	52
Classification	Diabetes (test)	Accuracy58.34	49
Tabular Synthetic Data Generation	DEFAULT	C2ST11.31	43
Classification	magic	F1 Score76	36
Regression	Insurance	R^20.72	36
Classification	Adult	F176	34
Tabular Classification	Diabetes (test)	Accuracy82.94	32
Regression	ERICH	R^20.38	32
Classification	HELOC (test)	Accuracy82.14	31

Showing 10 of 232 rows

...

Other info

Follow for update

@wizwand_team Discord