Doubling Your Data in Minutes: Ultra-fast Tabular Data Generation via LLM-Induced Dependency Graphs

About

Tabular data is critical across diverse domains, yet high-quality datasets remain scarce due to privacy concerns and the cost of collection. Contemporary approaches adopt large language models (LLMs) for tabular augmentation, but exhibit two major limitations: (1) dense dependency modeling among tabular features that can introduce bias, and (2) high computational overhead in sampling. To address these issues, we propose SPADA for SPArse Dependency-driven Augmentation, a lightweight generative framework that explicitly captures sparse dependencies via an LLM-induced graph. We treat each feature as a node and synthesize values by traversing the graph, conditioning each feature solely on its parent nodes. We explore two synthesis strategies: a non-parametric method using Gaussian kernel density estimation, and a conditional normalizing flow model that learns invertible mappings for conditional density estimation. Experiments on four datasets show that SPADA reduces constraint violations by 4% compared to diffusion-based methods and accelerates generation by nearly 9,500 times over LLM-based baselines.

Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci• 2025

Related benchmarks

Task	Dataset	Result
Classification	Electricity (test)	Accuracy76.37	55
Classification	UCI Mice Protein (test)	Accuracy96.13	50
Classification	Fourier (test)	Accuracy77.58	45
Classification	Steel (test)	Accuracy98.67	45
Classification	Credit-g (test)	Accuracy69.53	45
Regression	Insurance (test)	RMSE0.714	45
Tabular Classification	Diabetes (test)	Accuracy81.17	32
Classification	Income (test)	F1 Score75	27
Classification	Iris	Accuracy67.5	19
Tabular Classification	MIC (test)	Accuracy97.06	18

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord