Document Open Access Logo

Pattern Masking for Dictionary Matching

Authors Panagiotis Charalampopoulos , Huiping Chen , Peter Christen , Grigorios Loukides , Nadia Pisanti , Solon P. Pissis , Jakub Radoszewski



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2021.65.pdf
  • Filesize: 1.32 MB
  • 19 pages

Document Identifiers

Author Details

Panagiotis Charalampopoulos
  • The Interdisciplinary Center Herzliya, Israel
Huiping Chen
  • King’s College London, UK
Peter Christen
  • Australian National University, Canberra, Australia
Grigorios Loukides
  • King’s College London, UK
Nadia Pisanti
  • University of Pisa, Italy
Solon P. Pissis
  • CWI, Amsterdam, The Netherlands
  • Vrije Universiteit, Amsterdam, The Netherlands
Jakub Radoszewski
  • University of Warsaw, Poland

Cite AsGet BibTex

Panagiotis Charalampopoulos, Huiping Chen, Peter Christen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, and Jakub Radoszewski. Pattern Masking for Dictionary Matching. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 65:1-65:19, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ISAAC.2021.65

Abstract

Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary 𝒟 of d strings, each of length 𝓁, a query string q of length 𝓁, and a positive integer z, and we are asked to compute a smallest set K ⊆ {1,…,𝓁}, so that if q[i] is replaced by a wildcard for all i ∈ K, then q matches at least z strings from 𝒟. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We thus approach the problem from a more practical perspective. We show a combinatorial 𝒪((d𝓁)^{|K|/3}+d𝓁)-time and 𝒪(d𝓁)-space algorithm for PMDM for |K| = 𝒪(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails [Abboud et al., SIAM J. Comput. 2018; Lincoln et al., SODA 2018]. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in 𝒟. Note that PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from 𝒟. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time 𝒪(2^𝓁 d). We present a data structure for PMDM that answers queries over 𝒟 in time 𝒪(2^{𝓁/2}(2^{𝓁/2}+τ)𝓁) and requires space 𝒪(2^𝓁 d²/τ²+2^{𝓁/2}d), for any parameter τ ∈ [1,d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time 𝒪(d^{1/4+ε})-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • string algorithms
  • dictionary matching
  • wildcards
  • record linkage
  • query term dropping

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Secure critical data with Oracle Data Safe (white paper). https://www.oracle.com/a/tech/docs/dbsec/data-safe/wp-security-data-safe.pdf, September 2020.
  2. Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. If the current clique algorithms are optimal, so is Valiant’s parser. SIAM Journal on Computing, 47(6):2527-2555, 2018. URL: https://doi.org/10.1137/16M1061771.
  3. Peyman Afshani and Jesper Sindahl Nielsen. Data structure lower bounds for document indexing problems. In 43rd International Colloquium on Automata, Languages and Programming (ICALP 2016), volume 55 of LIPIcs, pages 93:1-93:15, 2016. URL: https://doi.org/10.4230/LIPIcs.ICALP.2016.93.
  4. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6):333-340, 1975. URL: https://doi.org/10.1145/360825.360855.
  5. Alberto Apostolico and Laxmi Parida. Incremental paradigms of motif discovery. Journal of Computational Biology, 11(1):15-25, 2004. URL: https://doi.org/10.1089/106652704773416867.
  6. Benny Applebaum. Pseudorandom generators with long stretch and low locality from random local one-way functions. SIAM Journal on Computing, 42(5):2008-2037, 2013. URL: https://doi.org/10.1137/120884857.
  7. Hiroki Arimura and Takeaki Uno. An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. Journal of Combinatorial Optimization, 13(3):243-262, 2007. URL: https://doi.org/10.1007/s10878-006-9029-1.
  8. David R. Bailey, Alexis J. Battle, Benedict A. Gomes, and P. Pandurang Nayak. Estimating confidence for query revision models, U.S. Patent US7617205B2 (granted to Google), 2009. Google Scholar
  9. Martha Bailey, Connor Cole, Morgan Henderson, and Catherine Massey. How well do automated linking methods perform? Lessons from U.S. historical data. NBER Working Papers 24019, National Bureau of Economic Research, Inc, 2017. URL: https://doi.org/10.3386/w24019.
  10. Giovanni Battaglia, Davide Cangelosi, Roberto Grossi, and Nadia Pisanti. Masking patterns in sequences: A new class of motif discovery with don't cares. Theoretical Computer Science, 410(43):4327-4340, 2009. URL: https://doi.org/10.1016/j.tcs.2009.07.014.
  11. Djamal Belazzougui. Faster and space-optimal edit distance "1" dictionary. In 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), volume 5577 of Lecture Notes in Computer Science, pages 154-167. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-02441-2_14.
  12. Djamal Belazzougui and Rossano Venturini. Compressed string dictionary search with edit distance one. Algorithmica, 74(3):1099-1122, 2016. URL: https://doi.org/10.1007/s00453-015-9990-0.
  13. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory of Computing Systems, 55(1):41-60, 2014. URL: https://doi.org/10.1007/s00224-013-9498-4.
  14. Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. Lower bounds for high dimensional nearest neighbor search and related problems. In 31st ACM Symposium on Theory of Computing (STOC 1999), pages 312-321, 1999. URL: https://doi.org/10.1145/301250.301330.
  15. Gerth Stølting Brodal and Srinivasan Venkatesh. Improved bounds for dictionary look-up with one error. Information Processing Letters, 75(1-2):57-59, 2000. URL: https://doi.org/10.1016/S0020-0190(00)00079-X.
  16. Chris Calabro, Russell Impagliazzo, and Ramamohan Paturi. The complexity of satisfiability of small depth circuits. In Parameterized and Exact Computation, 4th International Workshop (IWPEC 2009), volume 5917 of Lecture Notes in Computer Science, pages 75-85. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-11269-0_6.
  17. Ho-Leung Chan, Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. Compressed indexes for approximate string matching. Algorithmica, 58(2):263-281, 2010. URL: https://doi.org/10.1007/s00453-008-9263-2.
  18. Moses Charikar, Piotr Indyk, and Rina Panigrahy. New algorithms for subset query, partial match, orthogonal range searching, and related problems. In 29th International Colloquium on Automata, Languages and Programming (ICALP 2002), pages 451-462, 2002. URL: https://doi.org/10.1007/3-540-45465-9_39.
  19. Jianer Chen, Xiuzhen Huang, Iyad A. Kanj, and Ge Xia. Strong computational lower bounds via parameterized complexity. Journal of Computer and System Sciences, 72(8):1346-1367, 2006. URL: https://doi.org/10.1016/j.jcss.2006.04.007.
  20. Eden Chlamtáč, Michael Dinitz, Christian Konrad, Guy Kortsarz, and George Rabanca. The densest k-subhypergraph problem. SIAM Journal on Discrete Mathematics, 32(2):1458-1477, 2018. URL: https://doi.org/10.1137/16M1096402.
  21. Eden Chlamtáč, Michael Dinitz, and Yury Makarychev. Minimizing the union: Tight approximations for small set bipartite vertex expansion. In 28th ACM-SIAM Symposium on Discrete Algorithms (SODA 2017), pages 881-899, 2017. URL: https://doi.org/10.1137/1.9781611974782.56.
  22. Peter Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Heidelberg, 2012. URL: https://doi.org/10.1007/978-3-642-31164-2.
  23. Peter Christen, Thilina Ranbaduge, and Rainer Schnell. Linking Sensitive Data. Springer, Heidelberg, 2020. URL: https://doi.org/10.1007/978-3-030-59706-1.
  24. Vincent Cohen-Addad, Laurent Feuilloley, and Tatiana Starikovskaya. Lower bounds for text indexing with mismatches and differences. In 30th ACM-SIAM Symposium on Discrete Algorithms (SODA 2019), pages 1146-1164, 2019. URL: https://doi.org/10.1137/1.9781611975482.70.
  25. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In 36th ACM Symposium on Theory of Computing (STOC 2004), pages 91-100, 2004. URL: https://doi.org/10.1145/1007352.1007374.
  26. Alfredo Cuzzocrea and Hossain Shahriar. Data masking techniques for nosql database security: A systematic review. In 2017 IEEE International Conference on Big Data (BigData 2017), pages 4467-4473, 2017. URL: https://doi.org/10.1109/BigData.2017.8258486.
  27. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. Journal of the ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
  28. Pawel Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Algorithms - 22th Annual European Symposium (ESA 2014), volume 8737 of Lecture Notes in Computer Science, pages 455-466. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_38.
  29. Sreenivas Gollapudi, Samuel Ieong, Alexandros Ntoulas, and Stelios Paparizos. Efficient query rewrite for structured web queries. In 20th ACM International Conference on Information and Knowledge Management (CIKM 2011), pages 2417-2420, 2011. URL: https://doi.org/10.1145/2063576.2063981.
  30. Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, and Søren Vind. Motif trie: An efficient text index for pattern discovery with don't cares. Theoretical Computer Science, 710:74-87, 2018. URL: https://doi.org/10.1016/j.tcs.2017.04.012.
  31. Johan Hastad. Clique is hard to approximate within n^1-ε. Acta Mathematica, 182:105-142, 1999. URL: https://doi.org/10.1007/BF02392825.
  32. Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer, 2007. Google Scholar
  33. Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 40(4), 2008. URL: https://doi.org/10.1145/1391729.1391730.
  34. Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-SAT. Journal of Computer and System Sciences, 62(2):367-375, 2001. URL: https://doi.org/10.1006/jcss.2000.1727.
  35. T. S. Jayram, Subhash Khot, Ravi Kumar, and Yuval Rabani. Cell-probe lower bounds for the partial match problem. Journal of Computer and System Sciences, 69(3):435-447, 2004. URL: https://doi.org/10.1016/j.jcss.2004.04.006.
  36. Dimitrios Karapiperis, Aris Gkoulalas-Divanis, and Vassilios S. Verykios. Summarizing and linking electronic health records. Distributed and Parallel Databases, pages 1-40, 2019. URL: https://doi.org/10.1007/s10619-019-07263-0.
  37. Richard M. Karp. Reducibility among combinatorial problems. In 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pages 219-241. Springer, 2010. URL: https://doi.org/10.1007/978-3-540-68279-0_8.
  38. Hans Kellerer, Ulrich Pferschy, and David Pisinger. The Multiple-Choice Knapsack Problem, pages 317-347. Springer Berlin Heidelberg, 2004. URL: https://doi.org/10.1007/978-3-540-24777-7_11.
  39. Pradap Konda, Sanjib Das, Paul Suganthan G.C., Philip Martinkus, Adel Ardalan, Jeffrey R. Ballard, Yash Govind, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. Technical perspective: Toward building entity matching management systems. SIGMOD Record, 47(1):33-40, 2018. URL: https://doi.org/10.1145/3277006.3277015.
  40. Hye-Chung Kum, Ashok Krishnamurthy, Ashwin Machanavajjhala, Michael K. Reiter, and Stanley Ahalt. Privacy preserving interactive record linkage (PPIRL). Journal of the American Medical Informatics Association, 21(2):212-220, 2014. URL: https://doi.org/10.1136/amiajnl-2013-002165.
  41. Hye-Chung Kum, Eric D. Ragan, Gurudev Ilangovan, Mahin Ramezani, Qinbo Li, and Cason Schmit. Enhancing privacy through an interactive on-demand incremental information disclosure interface: Applying privacy-by-design to record linkage. In Fifteenth USENIX Conference on Usable Privacy and Security, pages 175-189, 2019. URL: https://doi.org/10.5555/3361476.3361489.
  42. Prathyusha Senthil Kumar, Praveen Arasada, and Ravi Chandra Jammalamadaka. Systems and methods for generating search query rewrites, U.S. Patent US10108712B2 (granted to ebay), 2018. Google Scholar
  43. Moshe Lewenstein, J. Ian Munro, Venkatesh Raman, and Sharma V. Thankachan. Less space: Indexing for queries with wildcards. Theoretical Computer Science, 557:120-127, 2014. URL: https://doi.org/10.1016/j.tcs.2014.09.003.
  44. Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern matching. In 31st Symposium on Theoretical Aspects of Computer Science (STACS 2014), pages 506-517, 2014. URL: https://doi.org/10.4230/LIPIcs.STACS.2014.506.
  45. Andrea Lincoln, Virginia Vassilevska Williams, and R. Ryan Williams. Tight hardness for shortest cycles and paths in sparse graphs. In 29th ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), pages 1236-1252, 2018. URL: https://doi.org/10.1137/1.9781611975031.80.
  46. Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. Journal of Computer and System Sciences, 57(1):37-49, 1998. URL: https://doi.org/10.1006/jcss.1998.1577.
  47. George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys, 53(2), 2020. URL: https://doi.org/10.1145/3377455.
  48. Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Daniel E. Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: Linear bound on irredundant motifs and an efficient polynomial time algorithm. In 11th ACM-SIAM Symposium on Discrete Algorithms (SODA 2000), pages 297-308, 2000. URL: https://doi.org/10.1145/338219.338266.
  49. Nadia Pisanti, Maxime Crochemore, Roberto Grossi, and Marie-France Sagot. Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(1):40-50, 2005. URL: https://doi.org/10.1109/TCBB.2005.5.
  50. Mihai Pǎtraşcu. Unifying the landscape of cell-probe lower bounds. SIAM Journal on Computing, 40(3):827-847, 2011. URL: https://doi.org/10.1137/09075336X.
  51. Mihai Pǎtraşcu and Mikkel Thorup. Higher lower bounds for near-neighbor and further rich problems. SIAM Journal on Computing, 39(2):730-741, 2009. URL: https://doi.org/10.1137/070684859.
  52. Eric D. Ragan, Hye-Chung Kum, Gurudev Ilangovan, and Han Wang. Balancing privacy and information disclosure in interactive record linkage with visual masking. In ACM Conference on Human Factors in Computing Systems (CHI 2018), 2018. URL: https://doi.org/10.1145/3173574.3173900.
  53. Ronald L. Rivest. Partial-match retrieval algorithms. SIAM Journal on Computing, 5(1):19-50, 1976. URL: https://doi.org/10.1137/0205003.
  54. Pierangela Samarati and Latanya Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), page 188. Association for Computing Machinery, 1998. URL: https://doi.org/10.1145/275487.275508.
  55. Pierangela Samarati and Latanya Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, Computer Science Laboratory, SRI International, 1998. Google Scholar
  56. Ricardo Jorge Santos, Jorge Bernardino, and Marco Vieira. A data masking technique for data warehouses. In 15th International Database Engineering and Applications Symposium (IDEAS 2011), pages 61-69, 2011. URL: https://doi.org/10.1145/2076623.2076632.
  57. Latanya Sweeney. Computational disclosure control: a primer on data privacy protection. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2001. URL: http://hdl.handle.net/1721.1/8589.
  58. Zehong Tan, Canran Xu, Mengjie Jiang, Hua Yang, and Xiaoyuan Wu. Query rewrite for null and low search results in ecommerce. In SIGIR Workshop On eCommerce, volume 2311 of CEUR Workshop Proceedings, 2017. Google Scholar
  59. Yufei Tao. Entity matching with active monotone classification. In 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2018), pages 49-62, 2018. URL: https://doi.org/10.1145/3196959.3196984.
  60. Dinusha Vatsalan and Peter Christen. Scalable privacy-preserving record linkage for multiple databases. In 23rd ACM International Conference on Information and Knowledge Management (CIKM 2014), pages 1795-1798, 2014. URL: https://doi.org/10.1145/2661829.2661875.
  61. Dinusha Vatsalan, Ziad Sehili, Peter Christen, and Erhard Rahm. Privacy-preserving record linkage for Big Data: Current approaches and research challenges. In Handbook of Big Data Technologies, pages 851-895. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-49340-4.
  62. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1-11. IEEE Computer Society, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
  63. Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In 2018 International Congress of Mathematicians (ICM), pages 3447-3487, 2019. URL: https://doi.org/10.1142/9789813272880_0188.
  64. Andrew Chi-Chih Yao and Frances F. Yao. Dictionary look-up with one error. Journal of Algorithms, 25(1):194-202, 1997. URL: https://doi.org/10.1006/jagm.1997.0875.
  65. David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Computing, 3(1):103-128, 2007. URL: https://doi.org/10.4086/toc.2007.v003a006.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail