Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

About

Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.

Xinyan Han, Yan Lu, Xiaoyu Lin, Yuanyuan Jiang, Yuanrui Wang, Xuanyue Li, Wenchao Zou, Xingxuan Zhang• 2026

Related benchmarks

Task	Dataset	Result
Tabular Data Augmentation	9 Datasets	XGBoost Score74.6	9
Tabular Data Augmentation	14 Datasets	XGBoost Score79.8	8
Synthetic Data Generation	Diabetes	Running Time18.17	8
Synthetic Data Generation	cpu activity	Running Time41.37	8
Synthetic Data Generation	kr-vs-kp	Running Time40.8	8
Synthetic Data Generation	health insurance	Running Time69.16	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord