Differential Privacy for Coverage Analysis of Software Traces

Authors Yu Hao, Sufian Latif, Hailong Zhang, Raef Bassily, Atanas Rountev



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2021.8.pdf
  • Filesize: 1.06 MB
  • 25 pages

Document Identifiers

Author Details

Yu Hao
  • Ohio State University, Columbus, OH, USA
Sufian Latif
  • Ohio State University, Columbus, OH, USA
Hailong Zhang
  • Fordham University, New York, NY, USA
Raef Bassily
  • Ohio State University, Columbus, OH, USA
Atanas Rountev
  • Ohio State University, Columbus, OH, USA

Acknowledgements

We thank the ECOOP reviewers for their valuable feedback.

Cite AsGet BibTex

Yu Hao, Sufian Latif, Hailong Zhang, Raef Bassily, and Atanas Rountev. Differential Privacy for Coverage Analysis of Software Traces. In 35th European Conference on Object-Oriented Programming (ECOOP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 194, pp. 8:1-8:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ECOOP.2021.8

Abstract

This work considers software execution traces, where a trace is a sequence of run-time events. Each user of a software system collects the set of traces covered by her execution of the software, and reports this set to an analysis server. Our goal is to report the local data of each user in a privacy-preserving manner by employing local differential privacy, a powerful theoretical framework for designing privacy-preserving data analysis. A significant advantage of such analysis is that it offers principled "built-in" privacy with clearly-defined and quantifiable privacy protections. In local differential privacy, the data of an individual user is modified using a local randomizer before being sent to the untrusted analysis server. Based on the randomized information from all users, the analysis server computes, for each trace, an estimate of how many users have covered it. Such analysis requires that the domain of possible traces be defined ahead of time. Unlike in prior related work, here the domain is either infinite or, at best, restricted to many billions of elements. Further, the traces in this domain typically have structure defined by the static properties of the software. To capture these novel aspects, we define the trace domain with the help of context-free grammars. We illustrate this approach with two exemplars: a call chain analysis in which traces are described through a regular language, and an enter/exit trace analysis in which traces are described by a balanced-parentheses context-free language. Randomization over such domains is challenging due to their large size, which makes it impossible to use prior randomization techniques. To solve this problem, we propose to use count sketch, a fixed-size hashing data structure for summarizing frequent items. We develop a version of count sketch for trace analysis and demonstrate its suitability for software execution data. In addition, instead of randomizing separately each contribution to the sketch, we develop a much-faster one-shot randomization of the accumulated sketch data. One important client of the collected information is the identification of high-frequency ("hot") traces. We develop a novel approach to identify hot traces from the collected randomized sketches. A key insight is that the very large domain of possible traces can be efficiently explored for hot traces by using the frequency estimates of a visited trace and its prefixes and suffixes. Our experimental study of both call chain analysis and enter/exit trace analysis indicates that the frequency estimates, as well as the identification of hot traces, achieve high accuracy and high privacy.

Subject Classification

ACM Subject Classification
  • Software and its engineering → Dynamic analysis
  • Security and privacy → Privacy-preserving protocols
Keywords
  • Trace Profiling
  • Differential Privacy
  • Program Analysis

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. ACM SIGACT/EATCS. Gödel Prize. https://sigact.org/prizes/g%C3%B6del/citation2017.pdf, 2017.
  2. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685-701, 2010. Google Scholar
  3. G. Ammons, T. Ball, and J. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In PLDI, page 85–96, 1997. Google Scholar
  4. Apple. Learning with privacy at scale. https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html, 2017.
  5. M. Arnold, S.J. Fink, D. Grove, M. Hind, and P.F. Sweeney. A survey of adaptive optimization in virtual machines. Proceedings of the IEEE, 93(2):449-466, 2005. Google Scholar
  6. B. Avent, A. Korolova, D. Zeber, T. Hovden, and B. Livshits. BLENDER: Enabling local search with a hybrid differential privacy model. In USENIX Security, pages 747-764, 2017. Google Scholar
  7. T. Ball and J. Larus. Optimally profiling and tracing programs. TOPLAS, 16(4):1319-1360, 1994. Google Scholar
  8. R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy hitters. In NIPS, pages 2285-2293, 2017. Google Scholar
  9. A. Budi, D. Lo, L. Jiang, and Lucia. kb-anonymity: A model for anonymized behaviour-preserving test and debugging data. In PLDI, pages 447-457, 2011. Google Scholar
  10. M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In ICALP, pages 693-703, 2002. Google Scholar
  11. S.R. Choudhary, A. Gorla, and A. Orso. Automated test input generation for android: Are we there yet? In ASE, pages 429-440. IEEE, 2015. Google Scholar
  12. J. Clause and A. Orso. A technique for enabling and supporting debugging of field failures. In ICSE, pages 261-270, 2007. Google Scholar
  13. J. Clause and A. Orso. Camouflage: Automated anonymization of field data. In ICSE, pages 21-30, 2011. Google Scholar
  14. A. Dajan, A. Lauger, P. Singer, D. Kifer, J. Reiter, A. Machanavajjhala, S. Garfinkel, S. Dahl, M. Graham, V. Karwa, H. Kim, P. Leclerc, I. Schmutte, W. Sexton, L. Vilhuber, and J. Abowd. The modernization of statistical disclosure limitation at the U.S. Census Bureau. https://www2.census.gov/cac/sac/meetings/2017-09/statistical-disclosure-limitation.pdf, 2017.
  15. M. Diep, M. Cohen, and S. Elbaum. Probe distribution techniques to profile events in deployed software. In ISSRE, pages 331-342, 2006. Google Scholar
  16. C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211-407, 2014. Google Scholar
  17. S. Elbaum and M. Hardojo. An empirical study of profiling strategies for released software and their impact on testing activities. In ISSTA, pages 65-75, 2004. Google Scholar
  18. Ú. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In CCS, pages 1054-1067, 2014. Google Scholar
  19. Facebook. Facebook analytics. https://analytics.facebook.com, 2020.
  20. A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evaluation. In OOPSLA, page 57–76, 2007. Google Scholar
  21. Google. Google Analytics. URL: https://analytics.google.com.
  22. Google. Firebase Analytics. https://firebase.google.com, 2020.
  23. Google. Monkey: UI/Application exerciser for Android. https://developer.android.com/studio/test/monkey, 2020.
  24. M. Grechanik, C. Csallner, C. Fu, and Q. Xie. Is data privacy always good for software testing? In ISSRE, pages 368-377, 2010. Google Scholar
  25. I. Hadar, T. Hasson, O. Ayalon, E. Toch, M. Birnhack, S. Sherman, and A. Balissa. Privacy by designers: Software developers’ privacy mindset. Empirical Software Engineering, 23(1):259-289, 2018. Google Scholar
  26. S. Han, Y. Dang, S. Ge, D. Zhang, and T. Xie. Performance debugging in the large via mining millions of stack traces. In ICSE, pages 145-155, 2012. Google Scholar
  27. M. Haran, A. Karr, A. Orso, A. Porter, and A. Sanil. Applying classification techniques to remotely-collected program execution data. In ESEC/FSE, pages 146-155, 2005. Google Scholar
  28. J. Hsu, M. Gaboardi, A. Haeberlen, S. Khanna, A. Narayan, B. C. Pierce, and A. Roth. Differential privacy: An economic method for choosing epsilon. In CSF, pages 398-410, 2014. Google Scholar
  29. W. Jin and A. Orso. BugRedux: Reproducing field failures for in-house debugging. In ICSE, pages 474-484, 2012. Google Scholar
  30. W. Jin and A. Orso. F3: Fault localization for field failures. In ISSTA, pages 213-223, 2013. Google Scholar
  31. Z. Li, X. Jing, X. Zhu, H. Zhang, B. Xu, and S. Ying. On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Transaction on Software Engineering, pages 1-21, 2017. Google Scholar
  32. Lucia, D. Lo, L. Jiang, and A. Budi. kbe-anonymity: Test data anonymization for evolving programs. In ASE, pages 262-265, 2012. Google Scholar
  33. Microsoft. New differential privacy platform co-developed with Harvard’s OpenDP unlocks data while safeguarding privacy. https://blogs.microsoft.com/on-the-issues/2020/06/24/differential-privacy-harvard-opendp, 2020.
  34. A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In S&P, pages 111-125, 2008. Google Scholar
  35. A. Narayanan and V. Shmatikov. De-anonymizing social networks. In S&P, pages 173-187, 2009. Google Scholar
  36. J. P. Near, D. Darais, C. Abuah, T. Stevens, P. Gaddamadugu, L. Wang, N. Somani, M. Zhang, N. Sharma, A. Shan, and D. Song. Duet: An expressive higher-order language and linear type system for statically enforcing differential privacy. Proceedings of the ACM on Programming Languages, 3(OOPSLA), 2019. Google Scholar
  37. Oath. Flurry. URL: http://flurry.com.
  38. P. Ohmann, A. Brooks, L. D'Antoni, and B. Liblit. Control-flow recovery from partial failure reports. In PLDI, pages 390-405, 2017. Google Scholar
  39. P. Ohmann, D. B. Brown, N. Neelakandan, J. Linderoth, and B. Liblit. Optimizing customized program coverage. In ASE, pages 27-38, 2016. Google Scholar
  40. OpenDP. https://projects.iq.harvard.edu/opendp, 2020.
  41. A. Orso, T. Apiwattanapong, and M. J. Harrold. Leveraging field data for impact analysis and regression testing. In ESEC/FSE, pages 128-137, 2003. Google Scholar
  42. A. Orso, D. Liang, M. J. Harrold, and R. Lipton. GAMMA system: Continuous evolution of software after deployment. In ISSTA, pages 65-69, 2002. Google Scholar
  43. C. Pavlopoulou and M. Young. Residual test coverage monitoring. In ICSE, pages 277-284, 1999. Google Scholar
  44. F. Peters and T. Menzies. Privacy and utility for defect prediction: Experiments with MORPH. In ICSE, pages 189-199, 2012. Google Scholar
  45. F. Peters, T. Menzies, L. Gong, and H. Zhang. Balancing privacy and utility in cross-company defect prediction. IEEE Transaction on Software Engineering, 39(8):1054-1068, 2013. Google Scholar
  46. T. Reps. Program analysis via graph reachability. IST, 40(11-12):701-726, 1998. Google Scholar
  47. Soot. Soot analysis framework. https://soot-oss.github.io/soot, 2020.
  48. M. Sridharan and R. Bodik. Refinement-based context-sensitive points-to analysis for Java. In PLDI, pages 387-400, 2006. Google Scholar
  49. W. N. Sumner, Y. Zheng, D. Weeratunge, and X. Zhang. Precise calling context encoding. IEEE Transaction on Software Engineering, 38(5):1160-1177, 2012. Google Scholar
  50. K. Taneja, M. Grechanik, R. Ghani, and T. Xie. Testing software in age of data privacy: A balancing act. In ESEC/FSE, pages 201-211, 2011. Google Scholar
  51. Uber. Uber releases project for differential privacy. https://medium.com/uber-security-privacy/differential-privacy-open-source-7892c82c42b6, July 2017.
  52. T. Wang, J. Blocki, N. Li, and S. Jha. Locally differentially private protocols for frequency estimation. In USENIX Security, pages 729-745, 2017. Google Scholar
  53. Y. Wang, Z. Ding, G. Wang, D. Kifer, and D. Zhang. Proving differential privacy with shadow execution. In PLDI, pages 655-669, 2019. Google Scholar
  54. S. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 309(60):63-69, 1965. Google Scholar
  55. A. Wood, M. Altman, A. Bembenek, M. Bun, M. Gaboardi, J. Honaker, K. Nissim, D. OquoterightBrien, T. Steinke, and S. Vadhan. Differential privacy: A primer for a non-technical audience. Vanderbilt Journal of Entertainment and Technology Law, 21(1):209-276, 2018. Google Scholar
  56. D. Zhang and D. Kifer. LightDP: Towards automating differential privacy proofs. In PLDI, pages 888-901, 2017. Google Scholar
  57. H. Zhang, Y. Hao, S. Latif, R. Bassily, and A. Rountev. Differentially-private software frequency profiling under linear constraints. Proceedings of the ACM on Programming Languages, 4(OOPSLA), 2020. Google Scholar
  58. H. Zhang, Y. Hao, S. Latif, R. Bassily, and A. Rountev. A study of event frequency profiling with differential privacy. In CC, page 51–62, 2020. Google Scholar
  59. H. Zhang, S. Latif, R. Bassily, and A. Rountev. Introducing privacy in screen event frequency analysis for Android apps. In SCAM, pages 268-279, 2019. Google Scholar
  60. H. Zhang, S. Latif, R. Bassily, and A. Rountev. Differentially-private control-flow node coverage for software usage analysis. In USENIX Security, pages 1021-1038, 2020. Google Scholar
  61. H. Zhang, E. Roth, A. Haeberlen, B. Pierce, and A. Roth. Testing differential privacy with dual interpreters. Proceedings of the ACM on Programming Languages, 4(OOPSLA), November 2020. Google Scholar
  62. X. Zhuang, M. Serrano, H. W Cain, and J.-D. Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI, pages 263-271, 2006. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail