Winning the NIST Contest: A scalable and general approach to differentially private synthetic data
About
We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | Br2000 (test) | Accuracy79.3 | 30 | |
| Classification | Adult dataset | Accuracy79.93 | 30 | |
| Classification | LPD | Accuracy72.03 | 27 | |
| Classification | Smoking Dataset | Accuracy63.29 | 24 | |
| Classification | DP Scaled Datasets 2x | Accuracy73.31 | 21 | |
| Classification | DP Scaled Datasets 3x | Accuracy73.59 | 21 | |
| Prediction | SCM marginal shift | ROC AUC1 | 9 | |
| Binary Classification | SCM spurious shift (test) | ROC AUC0.519 | 9 |