String Sanitization Under Edit Distance: Improved and Generalized

Authors Takuya Mieno , Solon P. Pissis , Leen Stougie, Michelle Sweering



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.19.pdf
  • Filesize: 0.95 MB
  • 18 pages

Document Identifiers

Author Details

Takuya Mieno
  • Kyushu University, Fukuoka, Japan
  • Japan Society for the Promotion of Science, Tokyo, Japan
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Leen Stougie
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Michelle Sweering
  • CWI, Amsterdam, The Netherlands

Acknowledgements

We wish to thank Grigorios Loukides (King’s College London) for useful discussions about improving the presentation of this manuscript.

Cite As Get BibTex

Takuya Mieno, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance: Improved and Generalized. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 19:1-19:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.CPM.2021.19

Abstract

Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_ED such that: (i) no string of 𝒮 occurs in X_ED; (ii) the order of all other length-k substrings over Σ (and thus the frequency) is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in 𝒪(n²k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:  
- An 𝒪(n²log²k)-time algorithm to solve ETFS. 
- An 𝒪(n²log²n)-time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of 𝒮 can have arbitrary lengths.  Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails. 
In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in 𝒪̃(n²) total time. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • string algorithms
  • data sanitization
  • edit distance
  • dynamic programming

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Osman Abul, Francesco Bonchi, and Fosca Giannotti. Hiding sequential and spatiotemporal patterns. IEEE Trans. Knowl. Data Eng., 22(12):1709-1723, 2010. URL: https://doi.org/10.1109/TKDE.2009.213.
  2. Osman Abul and Harun Gökçe. Knowledge hiding from tree and graph databases. Data Knowl. Eng., 72:148-171, 2012. URL: https://doi.org/10.1016/j.datak.2011.10.002.
  3. Alok Aggarwal, Maria M Klawe, Shlomo Moran, Peter Shor, and Robert Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1-4):195-208, 1987. Google Scholar
  4. Charu C. Aggarwal. Applications of frequent pattern mining. In Charu C. Aggarwal and Jiawei Han, editors, Frequent Pattern Mining, pages 443-467. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07821-2_18.
  5. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pages 3-14. IEEE Computer Society, 1995. URL: https://doi.org/10.1109/ICDE.1995.380415.
  6. Vladimir L. Arlazarov, Yefim A. Dinitz, MA Kronrod, and Igor A. Faradzhev. On economical construction of the transitive closure of an oriented graph. Doklady Akademii Nauk, 194(3):487-488, 1970. Google Scholar
  7. Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. String sanitization: A combinatorial approach. In Ulf Brefeld, Élisa Fromont, Andreas Hotho, Arno J. Knobbe, Marloes H. Maathuis, and Céline Robardet, editors, Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Würzburg, Germany, September 16-20, 2019, Proceedings, Part I, volume 11906 of Lecture Notes in Computer Science, pages 627-644. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-46150-8_37.
  8. Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone, and Michelle Sweering. Combinatorial algorithms for string sanitization. ACM Trans. Knowl. Discov. Data, 15(1), 2020. URL: https://doi.org/10.1145/3418683.
  9. Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020), volume 161 of Leibniz International Proceedings in Informatics (LIPIcs), pages 7:1-7:14, Dagstuhl, Germany, 2020. Schloss Dagstuhl-Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.7.
  10. Giulia Bernardini, Alessio Conte, Garance Gourdel, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Giulia Punzi, Leen Stougie, and Michelle Sweering. Hide and mine in strings: Hardness and algorithms. In Claudia Plant, Haixun Wang, Alfredo Cuzzocrea, Carlo Zaniolo, and Xindong Wu, editors, 20th IEEE International Conference on Data Mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, pages 924-929. IEEE, 2020. URL: https://doi.org/10.1109/ICDM50108.2020.00103.
  11. Luca Bonomi, Liyue Fan, and Hongxia Jin. An information-theoretic approach to individual sequential data sanitization. In Paul N. Bennett, Vanja Josifovski, Jennifer Neville, and Filip Radlinski, editors, Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22-25, 2016, pages 337-346. ACM, 2016. URL: https://doi.org/10.1145/2835776.2835828.
  12. Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79-97. IEEE Computer Society, 2015. URL: https://doi.org/10.1109/FOCS.2015.15.
  13. Brian Brubach and Jay Ghurye. A succinct four Russians speedup for edit distance computation and one-against-many banded alignment. In Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  14. Meng Chen, Xiaohui Yu, and Yang Liu. Mining moving patterns for predicting next location. Inf. Syst., 54:156-168, 2015. URL: https://doi.org/10.1016/j.is.2015.07.001.
  15. Chris Clifton and Don Marks. Security and privacy implications of data mining. In In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 15-19, 1996. Google Scholar
  16. Maxime Crochemore, Gad M Landau, and Michal Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM journal on computing, 32(6):1654-1673, 2003. Google Scholar
  17. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
  18. Aris Gkoulalas-Divanis and Grigorios Loukides. Revisiting sequential pattern hiding to enhance utility. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 1316-1324. ACM, 2011. URL: https://doi.org/10.1145/2020408.2020605.
  19. Aris Gkoulalas-Divanis and Vassilios S. Verykios. An integer programming approach for frequent itemset hiding. In Philip S. Yu, Vassilis J. Tsotras, Edward A. Fox, and Bing Liu, editors, Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6-11, 2006, pages 748-757. ACM, 2006. URL: https://doi.org/10.1145/1183614.1183721.
  20. Aris Gkoulalas-Divanis and Vassilios S. Verykios. Exact knowledge hiding through database extension. IEEE Trans. Knowl. Data Eng., 21(5):699-713, 2009. URL: https://doi.org/10.1109/TKDE.2008.199.
  21. Robert Gwadera, Aris Gkoulalas-Divanis, and Grigorios Loukides. Permutation-based sequential pattern hiding. In Hui Xiong, George Karypis, Bhavani M. Thuraisingham, Diane J. Cook, and Xindong Wu, editors, 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013, pages 241-250. IEEE Computer Society, 2013. URL: https://doi.org/10.1109/ICDM.2013.57.
  22. Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. J. Comput. Syst. Sci., 62(2):367-375, 2001. URL: https://doi.org/10.1006/jcss.2000.1727.
  23. Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 63(4):512-530, 2001. URL: https://doi.org/10.1006/jcss.2001.1774.
  24. Philip N Klein. Multiple-source shortest paths in planar graphs. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 146-155. Society for Industrial and Applied Mathematics, 2005. Google Scholar
  25. Daniel C. Koboldt, Karyn M. Steinberg, David E. Larson, Richard K. Wilson, and Elaine R. Mardis. The next-generation sequencing revolution and its impact on genomics. Cell, 155(1):27-38, 2013. Google Scholar
  26. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966. Google Scholar
  27. Grigorios Loukides and Robert Gwadera. Optimal event sequence sanitization. In Suresh Venkatasubramanian and Jieping Ye, editors, Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, April 30 - May 2, 2015, pages 775-783. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611974010.87.
  28. Eugene W. Myers and Webb Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5-37, 1989. Google Scholar
  29. Stanley R. M. Oliveira and Osmar R. Zaïane. Protecting sensitive knowledge by data sanitization. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida, USA, pages 613-616. IEEE Computer Society, 2003. URL: https://doi.org/10.1109/ICDM.2003.1250990.
  30. Jeanette P Schmidt. All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM Journal on Computing, 27(4):972-992, 1998. Google Scholar
  31. Vassilios S. Verykios, Ahmed K. Elmagarmid, Elisa Bertino, Yücel Saygin, and Elena Dasseni. Association rule hiding. IEEE Trans. Knowl. Data Eng., 16(4):434-447, 2004. URL: https://doi.org/10.1109/TKDE.2004.1269668.
  32. Yi-Hung Wu, Chia-Ming Chiang, and Arbee L. P. Chen. Hiding sensitive association rules with limited side effects. IEEE Trans. Knowl. Data Eng., 19(1):29-42, 2007. URL: https://doi.org/10.1109/TKDE.2007.250583.
  33. Josh Jia-Ching Ying, Wang-Chien Lee, Tz-Chiao Weng, and Vincent S. Tseng. Semantic trajectory mining for location prediction. In Isabel F. Cruz, Divyakant Agrawal, Christian S. Jensen, Eyal Ofek, and Egemen Tanin, editors, 19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM-GIS 2011, November 1-4, 2011, Chicago, IL, USA, Proceedings, pages 34-43. ACM, 2011. URL: https://doi.org/10.1145/2093973.2093980.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail