Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs
About
Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over LLM-based baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | Electricity (test) | Accuracy76.37 | 55 | |
| Classification | UCI Mice Protein (test) | Accuracy96.13 | 50 | |
| Classification | Fourier (test) | Accuracy77.58 | 45 | |
| Classification | Steel (test) | Accuracy98.67 | 45 | |
| Classification | Credit-g (test) | Accuracy69.53 | 45 | |
| Regression | Insurance (test) | RMSE0.714 | 45 | |
| Tabular Classification | Diabetes (test) | Accuracy81.17 | 32 | |
| Classification | Income (test) | F1 Score75 | 27 | |
| Tabular Classification | MIC (test) | Accuracy97.06 | 18 | |
| Tabular Regression | Housing (test) | MAPE25 | 18 |