String Sanitization Under Edit Distance

Authors Giulia Bernardini , Huiping Chen, Grigorios Loukides , Nadia Pisanti , Solon P. Pissis , Leen Stougie, Michelle Sweering

Thumbnail PDF


  • Filesize: 0.83 MB
  • 14 pages

Document Identifiers

Author Details

Giulia Bernardini
  • University of Milano - Bicocca, Milan, Italy
Huiping Chen
  • King’s College London, UK
Grigorios Loukides
  • King’s College London, UK
Nadia Pisanti
  • University of Pisa, Italy
  • ERABLE Team, Lyon, France
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
  • ERABLE Team, Lyon, France
Leen Stougie
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
  • ERABLE Team, Lyon, France
Michelle Sweering
  • CWI, Amsterdam, The Netherlands


The authors would like to thank Takuya Mieno (Kyushu University) for proofreading the manuscript.

Cite AsGet BibTex

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of 𝒮 occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in 𝒪(kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • String algorithms
  • data sanitization
  • edit distance
  • dynamic programming
  • conditional lower bound


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. O. Abul, F. Bonchi, and F. Giannotti. Hiding sequential and spatiotemporal patterns. IEEE Transactions on Knowledge and Data Engineering, 22(12):1709-1723, 2010. Google Scholar
  2. A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In 47th ACM Annual Symposium on Theory of Computing (STOC), pages 51-58, 2015. Google Scholar
  3. G. Bernardini, H. Chen, A. Conte, R. Grossi, G. Loukides, N. Pisanti, S. Pissis, and G. Rosone. String sanitization: A combinatorial approach. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages 627-644, 2019. Google Scholar
  4. G. Bernardini, H. Chen, A. Conte, R. Grossi, G. Loukides, N. Pisanti, S. Pissis, G. Rosone, and M. Sweering. Combinatorial algorithms for string sanitization. arXiv, 2019. Google Scholar
  5. G. Bernardini, H. Chen, G. Fici, G. Loukides, and S. P. Pissis. Reverse-safe data structures for text indexing. In Symposium on Algorithm Engineering and Experiments (ALENEX), pages 199-213, 2020. Google Scholar
  6. L. Bonomi, L. Fan, and H. Jin. An information-theoretic approach to individual sequential data sanitization. In 9th ACM International Conference on Web Search and Data Mining (WSDM), pages 337-346, 2016. Google Scholar
  7. K. Bringmann and M. Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In 56th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 79-97, 2015. Google Scholar
  8. U.S. Department of Health & Human Services. Health Insurance Portablility and Accountability Act., 1996. URL:
  9. R. Gwadera, A. Gkoulalas-Divanis, and G. Loukides. Permutation-based sequential pattern hiding. In 13th IEEE International Conference on Data Mining (ICDM), pages 241-250, 2013. Google Scholar
  10. R. Impagliazzo and R. Paturi. On the complexity of k-SAT. Journal of Computer and Systems Sciences, 62(2):367-375, 2001. Google Scholar
  11. R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential complexity? Journal of Computer and Systems Sciences, 63(4):512-530, 2001. Google Scholar
  12. L. Jin, C. Li, and R. Vernica. SEPIA: estimating selectivities of approximate string predicates in large databases. The VLDB Journal, 17(5):1213-1229, 2008. Google Scholar
  13. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966. Google Scholar
  14. A. Liu, K. Zhengy, L. Liz, G. Liu, L. Zhao, and X. Zhou. Efficient secure similarity computation on encrypted trajectory data. In 31st IEEE International Conference on Data Engineering (ICDE), pages 66-77, 2015. Google Scholar
  15. G. Loukides and R. Gwadera. Optimal event sequence sanitization. In SIAM International Conference on Data Mining (SDM), pages 775-783, 2015. Google Scholar
  16. W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. Efficiently supporting edit distance based string similarity search using B^+-trees. IEEE Transactions on Knowledge and Data Engineering, 26(12):2983-2996, 2014. Google Scholar
  17. B. Malin and L. Sweeney. Determining the identifiability of DNA database entries. In American Medical Informatics Association Annual Symposium (AMIA), pages 537-541, 2000. Google Scholar
  18. E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5-37, 1989. Google Scholar
  19. European Parliament. General Data Protection Regulation. URL:
  20. G. Poulis, S. Skiadopoulos, G. Loukides, and A. Gkoulalas-Divanis. Apriori-based algorithms for km-anonymizing trajectory data. Transactions on Data Privacy, 7:165-194, 2014. Google Scholar
  21. J. Shang, J. Peng, and J. Han. MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under edit distance. In SIAM International Conference on Data Mining (SDM), pages 558-566, 2016. Google Scholar
  22. H. J. Smith, T. Dinev, and H. Xu. Information privacy research: An interdisciplinary review. MIS Quarterly, 35(4):989-1015, 2011. Google Scholar
  23. M. Terrovitis, G. Poulis, N. Mamoulis, and S. Skiadopoulos. Local suppression and splitting techniques for privacy preserving publication of trajectories. IEEE Transactions on Knowledge and Data Engineering, 29(7):1466-1479, 2017. Google Scholar
  24. Z. Wen, D. Deng, R. Zhang, and R. Kotagiri. 2ED: An Efficient Entity Extraction Algorithm using two-level Edit-Distance. In 35th IEEE International Conference on Data Engineering (ICDE), pages 998-1009, 2019. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail