Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Breaking the Quality-Privacy Tradeoff in Tabular Data Generation via In-Context Learning

About

Tabular data synthesis aims to generate high-quality data while preserving privacy. However, we find that existing tabular generative models exhibit a clear tradeoff in the small-data regime: improving data quality typically comes at the cost of increased memorization of training samples, thereby weakening privacy protection. This tradeoff arises because small training sets make it difficult for dataset-specific generative models to distinguish generalizable structure from sample-specific patterns. To address this, we propose DiffICL, which formulates tabular data generation as an in-context learning problem. Instead of fitting each dataset from scratch,DiffICL leverages pretrained structural priors learned from a large collection of datasets, enabling it to infer data distributions from limited context rather than memorizing individual samples. We evaluate DiffICL on 14 real-world datasets. Results show that DiffICL improves both data quality and privacy, and generate synthetic data that provides effective data augmentation. Our findings suggest that the quality-privacy tradeoff can be improved through better training paradigms.

Xinyan Han, Yan Lu, Xiaoyu Lin, Yuanyuan Jiang, Yuanrui Wang, Xuanyue Li, Wenchao Zou, Xingxuan Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Tabular Data Augmentation9 Datasets
XGBoost Score74.6
9
Tabular Data Augmentation14 Datasets
XGBoost Score79.8
8
Synthetic Data GenerationDiabetes
Running Time18.17
8
Synthetic Data Generationcpu activity
Running Time41.37
8
Synthetic Data Generationkr-vs-kp
Running Time40.8
8
Synthetic Data Generationhealth insurance
Running Time69.16
8
Showing 6 of 6 rows

Other info

Follow for update