Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
About
Tabular data is hard to acquire and is subject to missing values. This paper introduces a novel approach for generating and imputing mixed-type (continuous and categorical) tabular data utilizing score-based diffusion and conditional flow matching. In contrast to prior methods that rely on neural networks to learn the score function or the vector field, we adopt XGBoost, a widely used Gradient-Boosted Tree (GBT) technique. To test our method, we build one of the most extensive benchmarks for tabular data generation and imputation, containing 27 diverse datasets and 9 metrics. Through empirical evaluation across the benchmark, we demonstrate that our approach outperforms deep-learning generation methods in data generation tasks and remains competitive in data imputation. Notably, it can be trained in parallel using CPUs without requiring a GPU. Our Python and R code is available at https://github.com/SamsungSAILMontreal/ForestDiffusion.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Data Imputation | MissBench (overall) | MCAR Score81.9 | 15 | |
| Tabular Imputation | MissBench (test) | MCAR Score0.22 | 15 | |
| Imputation | OpenML MCAR, Missing Probability 0.4 (test) | MAD0.181 | 13 |