Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation

About

Large language models (LLMs) have emerged as a powerful tool for synthetic data generation. A particularly important use case is producing synthetic replicas of private text, which requires carefully balancing privacy and utility. We propose Realistic and Privacy-Preserving Synthetic Data Generation (RPSG), which uses private seeds and integrates privacy-preserving strategies, including a formal differential privacy (DP) mechanism in the candidate selection, to generate realistic synthetic data. Comprehensive experiments against state-of-the-art private synthetic data generation methods demonstrate that RPSG achieves high fidelity to private data while providing strong privacy protection.

Qian Ma, Sarah Rajtmajer• 2026

Related benchmarks

Task	Dataset	Result
Next-token prediction	Pubmed	Next Token Accuracy36.1	40
Next-word prediction	REDDIT	Accuracy35.9	12
Distributional and Semantic Similarity Evaluation	REDDIT	FID0.07	12
Membership Inference Attack	Reddit (test)	AUC (PPL)54.3	12
Lexical Diversity	REDDIT	Self-BLEU41	12
Distributional and Semantic Similarity Evaluation	Pubmed	FID0.03	8
Lexical Diversity	Pubmed	Self-BLEU33	8
Membership Inference Attack	PubMed (test)	AUC (PPL)53.3	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord