Controlling Privacy Loss in Sampling Schemes: An Analysis of Stratified and Cluster Sampling

Bun, Mark; Drechsler, Jörg; Gaboardi, Marco; McMillan, Audra; Sarathy, Jayshree

doi:10.4230/LIPIcs.FORC.2022.1

Abstract

Sampling schemes are fundamental tools in statistics, survey design, and algorithm design. A fundamental result in differential privacy is that a differentially private mechanism run on a simple random sample of a population provides stronger privacy guarantees than the same algorithm run on the entire population. However, in practice, sampling designs are often more complex than the simple, data-independent sampling schemes that are addressed in prior work. In this work, we extend the study of privacy amplification results to more complex, data-dependent sampling schemes. We find that not only do these sampling schemes often fail to amplify privacy, they can actually result in privacy degradation. We analyze the privacy implications of the pervasive cluster sampling and stratified sampling paradigms, as well as provide some insight into the study of more general sampling designs.

Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Edgar R. Weippl, Stefan Katzenbeisser, Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pages 308-318. ACM, 2016. URL: https://doi.org/10.1145/2976749.2978318.
Julaiti Alafate and Yoav S Freund. Faster boosting with smaller memory. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper/2019/file/3ffebb08d23c609875d7177ee769a3e9-Paper.pdf.
Mohammad Alaggan, Sébastien Gambs, and Anne-Marie Kermarrec. Heterogeneous differential privacy. Journal of Privacy and Confidentiality, 7, April 2015.
Hilal Asi and John C. Duchi. Near instance-optimality in differential privacy, 2020. URL: http://arxiv.org/abs/2005.10630.
Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 6280-6290, 2018.
Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy profiles and amplification by subsampling. Journal of Privacy and Confidentiality, 10(1), January 2020.
Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 55th Annual IEEE Symposium on Foundations of Computer Science, FOCS '14, pages 464-473, Washington, DC, USA, 2014. IEEE Computer Society.
Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. SAQE: practical privacy-preserving approximate query processing for data federations. Proc. VLDB Endow., 13(11):2691-2705, 2020. URL: http://www.vldb.org/pvldb/vol13/p2691-bater.pdf.
Amos Beimel, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the sample complexity for private learning and private data release. In Daniele Micciancio, editor, Theory of Cryptography, 7th Theory of Cryptography Conference, TCC 2010, Zurich, Switzerland, February 9-11, 2010. Proceedings, volume 5978 of Lecture Notes in Computer Science, pages 437-454. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-11799-2_26.
Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the sample complexity of private learners. In Robert D. Kleinberg, editor, Innovations in Theoretical Computer Science, ITCS '13, Berkeley, CA, USA, January 9-12, 2013, pages 97-110. ACM, 2013. URL: https://doi.org/10.1145/2422436.2422450.
Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and learning of threshold functions. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science, FOCS '15, pages 634-649, Washington, DC, USA, 2015. IEEE Computer Society.
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC '06, pages 265-284, Berlin, Heidelberg, 2006. Springer.
Hamid Ebadi, Thibaud Antignac, and David Sands. Sampling and partitioning for differential privacy. In 14th Annual Conference on Privacy, Security and Trust, PST 2016, Auckland, New Zealand, December 12-14, 2016, pages 664-673. IEEE, 2016. URL: https://doi.org/10.1109/PST.2016.7906954.
Hamid Ebadi, David Sands, and Gerardo Schneider. Differential privacy: Now it’s getting personal. SIGPLAN Not., 50(1):69-81, January 2015.
Fabian Eckert, Teresa C. Fort, Peter K. Schott, and Natalie J. Yang. Imputing missing values in the us census bureau’s county business patterns. Technical report, National Bureau of Economic Research, 2021.
Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. Clustered sampling: Low-variance and improved representativity for clients selection in federated learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 3407-3416. PMLR, 2021. URL: http://proceedings.mlr.press/v139/fraboni21a.html.
Marco Gaboardi, Michael Hay, and Salil Vadhan. A programming framework for opendp. Manuscript, 2020.
Marco Gaboardi, James Honaker, Gary King, Jack Murtagh, Kobbi Nissim, Jonathan Ullman, and Salil Vadhan. Psi (ψ): A private data sharing interface. arXiv preprint, 2016. URL: http://arxiv.org/abs/1609.04340.
Antonious M. Girgis, Deepesh Data, Suhas N. Diggavi, Peter Kairouz, and Ananda Theertha Suresh. Shuffled model of differential privacy in federated learning. In Arindam Banerjee and Kenji Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 2521-2529. PMLR, 2021. URL: http://proceedings.mlr.press/v130/girgis21a.html.
Jacob Imola and Kamalika Chaudhuri. Privacy amplification via bernoulli sampling. arXiv preprint, 2021. URL: http://arxiv.org/abs/2105.10594.
Joonas Jälkö, Antti Honkela, and Onur Dikmen. Differentially private variational inference for non-conjugate models. In Gal Elidan, Kristian Kersting, and Alexander T. Ihler, editors, Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. AUAI Press, 2017. URL: http://auai.org/uai2017/proceedings/papers/152.pdf.
Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential privacy. In 2015 IEEE 31st International Conference on Data Engineering, pages 1023-1034, 2015.
Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793-826, 2011.
Wouter Kool, Herke van Hoof, and Max Welling. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations, 2020. URL: https://openreview.net/forum?id=rklEj2EFvB.
Levi H. S. Lelis, Roni Stern, Shahab Jabbari Arfaee, Sandra Zilles, Ariel Felner, and Robert C. Holte. Predicting optimal solution costs with bidirectional stratified sampling in regular search spaces. Artif. Intell., 230:51-73, 2016. URL: https://doi.org/10.1016/j.artint.2015.09.012.
Jerzy Neyman. On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558-606, 1934.
Hieu Tat Nguyen and Arnold W. M. Smeulders. Active learning using pre-clustering. In Carla E. Brodley, editor, Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference Proceeding Series. ACM, 2004. URL: https://doi.org/10.1145/1015330.1015349.
Mijung Park, James R. Foulds, Kamalika Chaudhuri, and Max Welling. Variational bayes in private settings (VIPS). J. Artif. Intell. Res., 68:109-157, 2020. URL: https://doi.org/10.1613/jair.1.11763.
Adam Smith. Differential privacy and the secrecy of the sample, February 2010. URL: https://adamdsmith.wordpress.com/2009/09/02/sample-secrecy/.
Salil Vadhan. The complexity of differential privacy. In Yehuda Lindell, editor, Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, chapter 7, pages 347-450. Springer International Publishing AG, Cham, Switzerland, 2017.
Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled renyi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 1226-1235. PMLR, 2019. URL: http://proceedings.mlr.press/v89/wang19b.html.
Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 2493-2502. JMLR.org, 2015. URL: http://proceedings.mlr.press/v37/wangg15.html.
Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle. J. Mach. Learn. Res., 17:183:1-183:40, 2016.
Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. A minimax theory for adaptive data analysis. arXiv preprint, 2016. URL: http://arxiv.org/abs/1602.04287.
Larry Wasserman and Shuheng Zhou. A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489):375-389, 2010.

Controlling Privacy Loss in Sampling Schemes: An Analysis of Stratified and Cluster Sampling

Authors Mark Bun, Jörg Drechsler, Marco Gaboardi, Audra McMillan, Jayshree Sarathy

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Controlling Privacy Loss in Sampling Schemes: An Analysis of Stratified and Cluster Sampling

Authors Mark Bun, Jörg Drechsler, Marco Gaboardi, Audra McMillan, Jayshree Sarathy

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message