Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning
About
Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tabular Data Augmentation | 9 Datasets | XGBoost Score74.6 | 9 | |
| Tabular Data Augmentation | 14 Datasets | XGBoost Score79.8 | 8 | |
| Synthetic Data Generation | Diabetes | Running Time18.17 | 8 | |
| Synthetic Data Generation | cpu activity | Running Time41.37 | 8 | |
| Synthetic Data Generation | kr-vs-kp | Running Time40.8 | 8 | |
| Synthetic Data Generation | health insurance | Running Time69.16 | 8 |