Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

About

Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over LLM-based baselines.

Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci• 2025

Related benchmarks

TaskDatasetResultRank
ClassificationElectricity (test)
Accuracy76.37
55
ClassificationUCI Mice Protein (test)
Accuracy96.13
50
ClassificationFourier (test)
Accuracy77.58
45
ClassificationSteel (test)
Accuracy98.67
45
ClassificationCredit-g (test)
Accuracy69.53
45
RegressionInsurance (test)
RMSE0.714
45
Tabular ClassificationDiabetes (test)
Accuracy81.17
32
ClassificationIncome (test)
F1 Score75
27
Tabular ClassificationMIC (test)
Accuracy97.06
18
Tabular RegressionHousing (test)
MAPE25
18
Showing 10 of 21 rows

Other info

Follow for update