TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation
About
Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Data Utility | Adult (test) | AUC0.894 | 14 | |
| Tabular Data Utility | Magic (test) | AUC0.903 | 14 | |
| Tabular Data Utility | California (test) | AUC0.975 | 14 | |
| Tabular Data Utility | Default (test) | AUC0.713 | 14 | |
| Tabular Data Synthesis | Aggregate of five tabular datasets (full train vs original train) | Marginal Error9.14 | 13 | |
| Tabular Data Utility | Shoppers (test) | AUC0.879 | 13 | |
| Tabular Data Generation | Covertype (test) | Marginal5.09 | 4 |