Pattern Matching with Mismatches and Wildcards

Authors Gabriel Bathie , Panagiotis Charalampopoulos , Tatiana Starikovskaya



PDF
Thumbnail PDF

File

LIPIcs.ESA.2024.20.pdf
  • Filesize: 0.83 MB
  • 15 pages

Document Identifiers

Author Details

Gabriel Bathie
  • DIENS, École normale supérieure de Paris, PSL Research University, France
  • LaBRI, Université de Bordeaux, Talence, France
Panagiotis Charalampopoulos
  • Birkbeck, University of London, UK
Tatiana Starikovskaya
  • DIENS, École normale supérieure de Paris, PSL Research University, France

Cite AsGet BibTex

Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Pattern Matching with Mismatches and Wildcards. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 20:1-20:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ESA.2024.20

Abstract

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern P of length m containing D wildcards, a text T of length n, and an integer k, our objective is to identify all fragments of T within Hamming distance k from P. Our primary contribution is an algorithm with runtime 𝒪(n + (D+k)(G+k)⋅ n/m) for this problem. Here, G ≤ D represents the number of maximal wildcard fragments in P. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos, Kociumaka, and Wellnitz, FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when D, G, and k are small relative to n. For instance, if m = n/2, k = G = n^{2/5}, and D = n^{3/5}, our algorithm operates in 𝒪(n) time, surpassing the Ω(n^{6/5}) time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards (k = 0), we present a much simpler algorithm with runtime 𝒪(n + DG ⋅ n/m) that clearly illustrates our main technical innovation: the utilisation of positions of P that do not belong to any fragment of P with a density of wildcards much larger than D/m as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known 𝒪(n log m)-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if DG = o(m log m). We complement our algorithmic results with a structural characterization of the k-mismatch occurrences of P. We demonstrate that in a text of length 𝒪(m), these occurrences can be partitioned into 𝒪((D+k)(G+k)) arithmetic progressions. Additionally, we construct an infinite family of examples with Ω((D+k)k) arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • pattern matching
  • wildcards
  • mismatches
  • Hamming distance

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Karl R. Abrahamson. Generalized string matching. SIAM Journal on Computing, 16(6):1039-1051, 1987. Google Scholar
  2. Tatsuya Akutsu. Approximate string matching with don't care characters. Inf. Process. Lett., 55(5):235-239, 1995. URL: https://doi.org/10.1016/0020-0190(95)00111-O.
  3. Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. J. Algorithms, 50(2):257-275, 2004. URL: https://doi.org/10.1016/S0196-6774(03)00097-X.
  4. Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Longest common extensions with wildcards: Trade-off and applications. In ESA, 2024. to appear. Google Scholar
  5. Felix A. Behrend. On sets of integers which contain no three terms in arithmetical progression. Proceedings of the National Academy of Sciences, 32(12):331-332, 1946. Google Scholar
  6. Francine Blanchet-Sadri and Justin Lazarow. Suffix trees for partial words and the longest common compatible prefix problem. In LATA, volume 7810, pages 165-176. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-37064-9_16.
  7. Karl Bringmann, Philip Wellnitz, and Marvin Künnemann. Few matches or almost periodicity: Faster pattern matching with mismatches in compressed texts. In SODA, pages 1126-1145, 2019. URL: https://doi.org/10.1137/1.9781611975482.69.
  8. Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. Approximating text-to-pattern hamming distances. In STOC, pages 643-656, 2020. URL: https://doi.org/10.1145/3357713.3384266.
  9. Timothy M. Chan, Ce Jin, Virginia Vassilevska Williams, and Yinzhan Xu. Faster algorithms for text-to-pattern hamming distances. In FOCS, pages 2188-2203, 2023. URL: https://doi.org/10.1109/FOCS57990.2023.00136.
  10. Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Circular pattern matching with k mismatches. J. Comput. Syst. Sci., 115:73-85, 2021. URL: https://doi.org/10.1016/j.jcss.2020.07.003.
  11. Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In FOCS, pages 978-989, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00095.
  12. Peter Clifford and Raphaël Clifford. Simple deterministic wildcard matching. Inf. Process. Lett., 101(2):53-54, 2007. URL: https://doi.org/10.1016/j.ipl.2006.08.002.
  13. Raphaël Clifford, Klim Efremenko, Ely Porat, and Amir Rothschild. From coding theory to efficient pattern matching. In SODA, pages 778-784, 2009. Google Scholar
  14. Raphaël Clifford, Klim Efremenko, Ely Porat, and Amir Rothschild. Pattern matching with don't cares and few errors. J. Comput. Syst. Sci., 76(2):115-124, 2010. URL: https://doi.org/10.1016/j.jcss.2009.06.002.
  15. Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. The k-mismatch problem revisited. In SODA, pages 2039-2052, 2016. URL: https://doi.org/10.1137/1.9781611974331.CH142.
  16. Raphaël Clifford and Ely Porat. A filtering algorithm for k-mismatch with don't cares. Inf. Process. Lett., 110(22):1021-1025, 2010. URL: https://doi.org/10.1016/j.ipl.2010.08.012.
  17. Richard Cole and Ramesh Hariharan. Verifying candidate matches in sparse and wildcard matching. In STOC, pages 592-601, 2002. URL: https://doi.org/10.1145/509907.509992.
  18. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on Strings. Cambridge University Press, 2007. URL: https://doi.org/10.1017/cbo9780511546853.
  19. Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, and Tomasz Walen. A note on the longest common compatible prefix problem for partial words. J. Discrete Algorithms, 34:49-53, 2015. URL: https://doi.org/10.1016/J.JDA.2015.05.003.
  20. Michael Elkin. An improved construction of progression-free sets. In SODA, pages 886-905, 2010. URL: https://doi.org/10.1137/1.9781611973075.72.
  21. Paul Erdös and Paul Turán. On some sequences of integers. Journal of the London Mathematical Society, s1-11(4):261-264, 1936. URL: https://doi.org/10.1112/jlms/s1-11.4.261.
  22. Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: https://doi.org/10.2307/2034009.
  23. Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In SODA, pages 1509-1528, 2018. URL: https://doi.org/10.1137/1.9781611975031.99.
  24. Pawel Gawrychowski and Przemyslaw Uznański. Towards unified approximate pattern matching for Hamming and L_1 distance. In ICALP, pages 62:1-62:13, 2018. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.62.
  25. Yijie Han. Deterministic sorting in O(n log log n) time and linear space. J. Algorithms, 50(1):96-105, 2004. URL: https://doi.org/10.1016/j.jalgor.2003.09.001.
  26. Piotr Indyk. Faster algorithms for string matching problems: Matching the convolution bound. In FOCS, pages 166-173, 1998. URL: https://doi.org/10.1109/SFCS.1998.743440.
  27. Ce Jin and Jakob Nogler. Quantum speed-ups for string synchronizing sets, longest common substring, and k-mismatch matching. In SODA, pages 5090-5121, 2023. URL: https://doi.org/10.1137/1.9781611977554.ch186.
  28. Ce Jin and Yinzhan Xu. Shaving logs via large sieve inequality: Faster algorithms for sparse convolution and more. In STOC, pages 1573-1584, 2024. URL: https://doi.org/10.1145/3618260.3649605.
  29. Adam Kalai. Efficient pattern-matching with don't cares. In SODA, pages 655-656, 2002. Google Scholar
  30. Donald E Knuth, James H Morris, Jr, and Vaughan R Pratt. Fast pattern matching in strings. SIAM journal on computing, 6(2):323-350, 1977. Google Scholar
  31. Tomasz Kociumaka, Ely Porat, and Tatiana Starikovskaya. Small-space and streaming pattern matching with k edits. In FOCS, pages 885-896, 2021. URL: https://doi.org/10.1109/FOCS52979.2021.00090.
  32. S. Rao Kosaraju. Efficient string matching. Unpublished manuscript, 1987. Google Scholar
  33. Gad M. Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. J. Algorithms, 10(2):157-169, 1989. URL: https://doi.org/10.1016/0196-6774(89)90010-2.
  34. Udi Manber and Ricardo A. Baeza-Yates. An algorithm for string matching with a sequence of don't cares. Inf. Process. Lett., 37(3):133-136, 1991. URL: https://doi.org/10.1016/0020-0190(91)90032-D.
  35. Gerhard Mehldau and Gene Myers. A system for pattern matching applications on biosequences. Bioinformatics, 9(3):299-314, 1993. URL: https://doi.org/10.1093/bioinformatics/9.3.299.
  36. Michael S Paterson Michael J Fischer. String-matching and other products. In SCC, pages 113-125, 1974. Google Scholar
  37. Marius Nicolae and Sanguthevar Rajasekaran. On string matching with mismatches. Algorithms, 8(2):248-270, 2015. URL: https://doi.org/10.3390/a8020248.
  38. Marius Nicolae and Sanguthevar Rajasekaran. On pattern matching with k mismatches and few don't cares. Inf. Process. Lett., 118:78-82, 2017. URL: https://doi.org/10.1016/j.ipl.2016.10.003.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail