String Sanitization Under Edit Distance

Bernardini, Giulia; Chen, Huiping; Loukides, Grigorios; Pisanti, Nadia; Pissis, Solon P.; Stougie, Leen; Sweering, Michelle

doi:10.4230/LIPIcs.CPM.2020.7

File

LIPIcs.CPM.2020.7.pdf

Filesize: 0.83 MB
14 pages

Document Identifiers

DOI: 10.4230/LIPIcs.CPM.2020.7
URN: urn:nbn:de:0030-drops-121324

Author Details

Giulia Bernardini

University of Milano - Bicocca, Milan, Italy

Huiping Chen

King’s College London, UK

Grigorios Loukides

King’s College London, UK

Nadia Pisanti

University of Pisa, Italy
ERABLE Team, Lyon, France

Solon P. Pissis

CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
ERABLE Team, Lyon, France

Leen Stougie

CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
ERABLE Team, Lyon, France

Michelle Sweering

CWI, Amsterdam, The Netherlands

Acknowledgements

The authors would like to thank Takuya Mieno (Kyushu University) for proofreading the manuscript.

Cite AsGet BibTex

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.CPM.2020.7

Abstract

Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of 𝒮 occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in 𝒪(kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS.

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching

Keywords

String algorithms
data sanitization
edit distance
dynamic programming
conditional lower bound

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

O. Abul, F. Bonchi, and F. Giannotti. Hiding sequential and spatiotemporal patterns. IEEE Transactions on Knowledge and Data Engineering, 22(12):1709-1723, 2010.
A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In 47th ACM Annual Symposium on Theory of Computing (STOC), pages 51-58, 2015.
G. Bernardini, H. Chen, A. Conte, R. Grossi, G. Loukides, N. Pisanti, S. Pissis, and G. Rosone. String sanitization: A combinatorial approach. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages 627-644, 2019.
G. Bernardini, H. Chen, A. Conte, R. Grossi, G. Loukides, N. Pisanti, S. Pissis, G. Rosone, and M. Sweering. Combinatorial algorithms for string sanitization. arXiv, 2019.
G. Bernardini, H. Chen, G. Fici, G. Loukides, and S. P. Pissis. Reverse-safe data structures for text indexing. In Symposium on Algorithm Engineering and Experiments (ALENEX), pages 199-213, 2020.
L. Bonomi, L. Fan, and H. Jin. An information-theoretic approach to individual sequential data sanitization. In 9th ACM International Conference on Web Search and Data Mining (WSDM), pages 337-346, 2016.
K. Bringmann and M. Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In 56th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 79-97, 2015.
U.S. Department of Health & Human Services. Health Insurance Portablility and Accountability Act. https://aspe.hhs.gov/report/health-insurance-portability-and-accountability-act-1996, 1996. URL: https://aspe.hhs.gov/report/health-insurance-portability-and-accountability-act-1996.
R. Gwadera, A. Gkoulalas-Divanis, and G. Loukides. Permutation-based sequential pattern hiding. In 13th IEEE International Conference on Data Mining (ICDM), pages 241-250, 2013.
R. Impagliazzo and R. Paturi. On the complexity of k-SAT. Journal of Computer and Systems Sciences, 62(2):367-375, 2001.
R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential complexity? Journal of Computer and Systems Sciences, 63(4):512-530, 2001.
L. Jin, C. Li, and R. Vernica. SEPIA: estimating selectivities of approximate string predicates in large databases. The VLDB Journal, 17(5):1213-1229, 2008.
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966.
A. Liu, K. Zhengy, L. Liz, G. Liu, L. Zhao, and X. Zhou. Efficient secure similarity computation on encrypted trajectory data. In 31st IEEE International Conference on Data Engineering (ICDE), pages 66-77, 2015.
G. Loukides and R. Gwadera. Optimal event sequence sanitization. In SIAM International Conference on Data Mining (SDM), pages 775-783, 2015.
W. Lu, X. Du, M. Hadjieleftheriou, and B. C. Ooi. Efficiently supporting edit distance based string similarity search using B^+-trees. IEEE Transactions on Knowledge and Data Engineering, 26(12):2983-2996, 2014.
B. Malin and L. Sweeney. Determining the identifiability of DNA database entries. In American Medical Informatics Association Annual Symposium (AMIA), pages 537-541, 2000.
E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5-37, 1989.
European Parliament. General Data Protection Regulation. URL: http://data.consilium.europa.eu/doc/document/ST-9565-2015-INIT/en/pdf.
G. Poulis, S. Skiadopoulos, G. Loukides, and A. Gkoulalas-Divanis. Apriori-based algorithms for km-anonymizing trajectory data. Transactions on Data Privacy, 7:165-194, 2014.
J. Shang, J. Peng, and J. Han. MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under edit distance. In SIAM International Conference on Data Mining (SDM), pages 558-566, 2016.
H. J. Smith, T. Dinev, and H. Xu. Information privacy research: An interdisciplinary review. MIS Quarterly, 35(4):989-1015, 2011.
M. Terrovitis, G. Poulis, N. Mamoulis, and S. Skiadopoulos. Local suppression and splitting techniques for privacy preserving publication of trajectories. IEEE Transactions on Knowledge and Data Engineering, 29(7):1466-1479, 2017.
Z. Wen, D. Deng, R. Zhang, and R. Kotagiri. 2ED: An Efficient Entity Extraction Algorithm using two-level Edit-Distance. In 35th IEEE International Conference on Data Engineering (ICDE), pages 998-1009, 2019.