Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

About

Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.

Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman• 2023

Related benchmarks

Task	Dataset	Result
Binary Classification	MagicTel 20%	PR AUC75.6	36
Binary Classification	Haberman 10%	PR AUC0.306	36
Binary Classification	CreditCard 0.2%	PR AUC77	36
Binary Classification	Abalone 1%	PR AUC5.6	24
Binary Classification	Phoneme 1%	PR AUC0.249	24
Binary Classification	California 1%	PR AUC29.4	24
Classification	ionosphere	PR AUC96.9	24
Multi-class classification	Yeast	--	20
Tabular Data Imputation	MissBench (overall)	MCAR Score81.9	15
Tabular Imputation	MissBench (test)	MCAR Score0.22	15

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord