Evaluating Privacy through Synthetic Data
Abstract:
Data synthesis has long been held out as a mechanism for enabling data sharing and independent analysis without privacy restrictions. Both empirical and theoretical work has found that one cannot simultaneously generate accurate data while also maintaining reasonable differential privacy guarantees. This prior work has focused on the synthetic data's accuracy - that is how close the synthetic data is to the real data. In this paper, we consider the problem of analyzing privacy-accuracy tradeoffs using synthetic data. In particular, we are interested in whether for $\epsilon$-differentially private mechanism $\mathcal{M}$, is the reduction in accuracy when we apply $\mathcal{M}$ to real data similar to the reduction in accuracy when we apply $\mathcal{M}$ to synthetic data? Unlike prior analyses of synthetic data, synthetic data as a privacy benchmark does not necessarily require closeness to real data. Through an empirical analysis of several differentially private data synthesis mechanisms across diverse population data sets, we find that these mechanisms do not provide data that can be used as a good benchmark for privacy-preserving mechanisms. Specifically, these mechanisms seem to produce data which has too much randomness, and are not able to capture a sufficient amount of dependencies to make a rich attack surface.
Committee:
- Anil Vullikanti, Committee Chair (CS, Biocomplexity/SEAS/UVA)
- Madhav Marathe, Advisor (CS, Biocomplexity /SEAS/UVA)
- Tianhao Wang (CS/SEAS/UVA)
- David Evans (CS/SEAS/UVA)