Longest Common Extensions with Wildcards: Trade-Off and Applications

Authors Gabriel Bathie , Panagiotis Charalampopoulos , Tatiana Starikovskaya



PDF
Thumbnail PDF

File

LIPIcs.ESA.2024.19.pdf
  • Filesize: 0.91 MB
  • 17 pages

Document Identifiers

Author Details

Gabriel Bathie
  • DIENS, École normale supérieure de Paris, PSL Research University, France
  • LaBRI, Université de Bordeaux, Talence, France
Panagiotis Charalampopoulos
  • Birkbeck, University of London, UK
Tatiana Starikovskaya
  • DIENS, École normale supérieure de Paris, PSL Research University, France

Cite AsGet BibTex

Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Longest Common Extensions with Wildcards: Trade-Off and Applications. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 19:1-19:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ESA.2024.19

Abstract

We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called "don't cares" or "holes") are special characters that match any other character in the alphabet, similar to the character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by G, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in O(n (G/t) log n) time, occupies O(nG/t) space, and answers queries in O(t) time, for any t ∈ [1 .. G]. Up to the O(log n) factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has O(nG) preprocessing time and space, and O(1) query time, and a simple solution based on the "kangaroo jumping" technique [Landau and Vishkin, STOC 1986], which has O(n) preprocessing time and space, and O(G) query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal up to subpolynomial factors when G = Ω(n) under a widely believed hypothesis. In addition, we develop a new simple, deterministic and combinatorial algorithm for sparse Boolean matrix multiplication. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards. First, we consider the problem of pattern matching with k errors (i.e., edit operations) in the setting where both the pattern and the text may contain wildcards. The "kangaroo jumping" technique can be adapted to yield an algorithm for this problem with runtime O(n(k+G)), where G is the total number of maximal contiguous groups of wildcards in the text and the pattern and n is the length of the text. By combining "kangaroo jumping" with a tailor-made data structure for LCE queries, Akutsu [IPL 1995] devised an O(n√{km} polylog m)-time algorithm. We improve on both algorithms when k ≪ G ≪ m by giving an algorithm with runtime O(n(k + √{Gk log n})). Secondly, we give O(n√G log n)-time and O(n)-space algorithms for computing the prefix array, as well as the quantum/deterministic border and period arrays of a string with wildcards. This is an improvement over the O(n√{nlog n})-time algorithms of Iliopoulos and Radoszewski [CPM 2016] when G = O(n / log n).

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Longest common prefix
  • longest common extension
  • wildcards
  • Boolean matrix multiplication
  • approximate pattern matching
  • periodicity arrays

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Karl Bringmann, Nick Fischer, and Marvin Künnemann. The time complexity of fully sparse matrix multiplication. In Proc. of SODA 2024, pages 4670-4703, 2024. URL: https://doi.org/10.1137/1.9781611977912.167.
  2. Amir Abboud, Virginia Vassilevska Williams, and Oren Weimann. Consequences of faster alignment of sequences. In Proc. of ICALP 2014, pages 39-51, 2014. URL: https://doi.org/10.1007/978-3-662-43948-7_4.
  3. Peyman Afshani and Jesper Sindahl Nielsen. Data structure lower bounds for document indexing problems. In Proc. of ICALP 2016, pages 93:1-93:15, 2016. URL: https://doi.org/10.4230/LIPICS.ICALP.2016.93.
  4. Tatsuya Akutsu. Approximate string matching with don't care characters. Inf. Process. Lett., 55(5):235-239, 1995. URL: https://doi.org/10.1016/0020-0190(95)00111-O.
  5. Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2):257-275, 2004. Google Scholar
  6. Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. A new characterization of maximal repetitions by Lyndon trees. In Proc. of SODA 2015, pages 562-571, 2015. URL: https://doi.org/10.1137/1.9781611973730.38.
  7. Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Pattern matching with mismatches and wildcards. CoRR, 2024. URL: https://doi.org/10.48550/arXiv.2402.07732.
  8. Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, and Inge Li Gørtz. Finger search in grammar-compressed strings. Theory of Computing Systems, 62:1715-1735, 2018. URL: https://doi.org/10.1007/S00224-017-9839-9.
  9. Philip Bille, Inge Li Gørtz, Patrick Hagge Cording, Benjamin Sach, Hjalte Wedel Vildhøj, and Søren Vind. Fingerprints in compressed strings. Journal of Computer and System Sciences, 86:171-180, 2017. URL: https://doi.org/10.1016/J.JCSS.2017.01.002.
  10. Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In Proc. of CPM 2015, pages 65-76, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_6.
  11. Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. Journal of Discrete Algorithms, 25:42-50, 2014. URL: https://doi.org/10.1016/J.JDA.2013.06.003.
  12. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41-60, 2014. URL: https://doi.org/10.1007/S00224-013-9498-4.
  13. Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In Proc. of SODA 2020, pages 607-626, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
  14. Francine Blanchet-Sadri, Rachel Harred, and Justin Lazarow. Longest common extensions in partial words. In Proc. of IWOCA 2015, volume 9538 of Lecture Notes in Computer Science, pages 52-64. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-29516-9_5.
  15. Francine Blanchet-Sadri and Justin Lazarow. Suffix trees for partial words and the longest common compatible prefix problem. In Proc. of LATA 2013, pages 165-176, 2013. URL: https://doi.org/10.1007/978-3-642-37064-9_16.
  16. Francine Blanchet-Sadri and S. Osborne. Computing longest common extensions in partial words. Discret. Appl. Math., 246:119-139, 2018. URL: https://doi.org/10.1016/J.DAM.2016.06.007.
  17. Béla Bollobás and Shoham Letzter. Longest common extension. Eur. J. Comb., 68:242-248, 2018. URL: https://doi.org/10.1016/J.EJC.2017.07.019.
  18. Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In Proc. of FOCS 2020, pages 978-989, 2020. Google Scholar
  19. Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest palindromic substring in sublinear time. In Proc. of CPM 2022, volume 223, pages 20:1-20:9, 2022. URL: https://doi.org/10.4230/LIPICS.CPM.2022.20.
  20. Peter Clifford and Raphaël Clifford. Simple deterministic wildcard matching. Information Processing Letters, 101(2):53-54, 2007. Google Scholar
  21. Raphaël Clifford, Klim Efremenko, Ely Porat, and Amir Rothschild. From coding theory to efficient pattern matching. In Proc. of SODA 2009, pages 778-784, 2009. Google Scholar
  22. Raphaël Clifford, Klim Efremenko, Ely Porat, and Amir Rothschild. Pattern matching with don't cares and few errors. J. Comput. Syst. Sci., 76(2):115-124, 2010. URL: https://doi.org/10.1016/j.jcss.2009.06.002.
  23. Raphaël Clifford, Allan Grønlund, Kasper Green Larsen, and Tatiana Starikovskaya. Upper and lower bounds for dynamic data structures on strings. In Proc. of STACS 2018, pages 22:1-22:14, 2018. URL: https://doi.org/10.4230/LIPICS.STACS.2018.22.
  24. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proc. of STOC 2004, pages 91-100, 2004. URL: https://doi.org/10.1145/1007352.1007374.
  25. Richard Cole and Ramesh Hariharan. Verifying candidate matches in sparse and wildcard matching. In Proc. of STOC 2002, pages 592-601, 2002. URL: https://doi.org/10.1145/509907.509992.
  26. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007. Google Scholar
  27. Maxime Crochemore, Costas S Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Jakub Radoszewski, Wojciech Rytter, Bartosz Szreder, and Tomasz Waleń. A note on the longest common compatible prefix problem for partial words. Journal of Discrete Algorithms, 34:49-53, 2015. URL: https://doi.org/10.1016/J.JDA.2015.05.003.
  28. Maxime Crochemore and Wojciech Rytter. Jewels of stringology. World Scientific, 2002. URL: https://doi.org/10.1142/4838.
  29. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proc. of CPM 2006, pages 36-48, 2006. Google Scholar
  30. Nick Fischer. Deterministic sparse pattern matching via the Baur-Strassen theorem. In Proc. of SODA 2024, pages 3333-3353, 2024. URL: https://doi.org/10.1137/1.9781611977912.119.
  31. Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. ACM SIGACT News, 17(4):52-54, 1986. Google Scholar
  32. Pawel Gawrychowski and Tomasz Kociumaka. Sparse suffix tree construction in optimal time and space. In Proc. of SODA 2017, pages 425-439, 2017. URL: https://doi.org/10.1137/1.9781611974782.27.
  33. Pawel Gawrychowski and Przemyslaw Uznanski. Towards unified approximate pattern matching for Hamming and L_1 distance. In Proc. of ICALP 2018, volume 107 of LIPIcs, pages 62:1-62:13, 2018. URL: https://doi.org/10.4230/LIPICS.ICALP.2018.62.
  34. Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In Proc. of SODA 2018, pages 1509-1528, 2018. URL: https://doi.org/10.1137/1.9781611975031.99.
  35. Shay Golan, Tsvi Kopelowitz, and Ely Porat. Streaming pattern matching with d wildcards. Algorithmica, 81(5):1988-2015, 2019. URL: https://doi.org/10.1007/S00453-018-0521-7.
  36. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  37. Fred G. Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw., 4(3):250-269, September 1978. URL: https://doi.org/10.1145/355791.355796.
  38. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338-355, 1984. URL: https://doi.org/10.1137/0213024.
  39. Jan Holub and William F. Smyth. Algorithms on indeterminate strings. In Proc. of IWOCA 2003, pages 36-45, 2003. Google Scholar
  40. Costas S. Iliopoulos, Manal Mohamed, Laurent Mouchard, Katerina Perdikuri, William F. Smyth, and Athanasios K. Tsakalidis. String regularities with don't cares. In Proc. of PSC 2002, pages 65-74, 2002. URL: http://www.stringology.org/event/2002/p8.html.
  41. Costas S. Iliopoulos and Jakub Radoszewski. Truly Subquadratic-Time Extension Queries and Periodicity Detection in Strings with Uncertainties. In Proc. of CPM 2016, volume 54, pages 8:1-8:12, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.8.
  42. Piotr Indyk. Faster algorithms for string matching problems: Matching the convolution bound. In Proc. of FOCS 1998, pages 166-173, 1998. URL: https://doi.org/10.1109/SFCS.1998.743440.
  43. Adam Kalai. Efficient pattern-matching with don't cares. In Proc. of SODA 2022, pages 655-656, 2002. Google Scholar
  44. Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proc. of STOC 2019, pages 756-767, 2019. URL: https://doi.org/10.1145/3313276.3316368.
  45. Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In Proc. of FOCS 2023, pages 1877-1886, 2023. URL: https://doi.org/10.1109/FOCS57990.2023.00114.
  46. Dominik Kempa and Barna Saha. An upper bound and linear-space queries on the LZ-end parsing. In Proc. of SODA 2022, pages 2847-2866, 2022. URL: https://doi.org/10.1137/1.9781611977073.111.
  47. Tomasz Kociumaka. Efficient Data Structures for Internal Queries in Texts. Phd thesis, University of Warsaw, October 2018. Available at URL: https://www.mimuw.edu.pl/~kociumaka/files/phd.pdf.
  48. Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In Proc. of FOCS 1999, pages 596-604, 1999. Google Scholar
  49. Roman Kolpakov and Gregory Kucherov. Searching for gapped palindromes. Theoretical Computer Science, 410(51):5365-5373, 2009. Google Scholar
  50. Dmitry Kosolobov. Tight lower bounds for the longest common extension problem. Information Processing Letters, 125:26-29, 2017. URL: https://doi.org/10.1016/J.IPL.2017.05.003.
  51. Dmitry Kosolobov and Nikita Sivukhin. Construction of sparse suffix trees and LCE indexes in optimal time and space. CoRR, abs/2105.03782, 2021. URL: https://arxiv.org/abs/2105.03782.
  52. Marvin Künnemann. On nondeterministic derandomization of Freivalds' algorithm: Consequences, avenues and algorithmic progress. In Proc. of ESA 2018, volume 112 of LIPIcs, pages 56:1-56:16, 2018. URL: https://doi.org/10.4230/LIPIcs.ESA.2018.56.
  53. Konstantin Kutzkov. Deterministic algorithms for skewed matrix products. In Proc. of STACS 2013, volume 20 of LIPIcs, pages 466-477, 2013. URL: https://doi.org/10.4230/LIPIcs.STACS.2013.466.
  54. Gad M Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239-249, 1986. Google Scholar
  55. Gad M. Landau and Uzi Vishkin. Introducing efficient parallelism into approximate string matching and a new serial algorithm. In Proc. of STOC 1986, pages 220-230, 1986. URL: https://doi.org/10.1145/12130.12152.
  56. Florin Manea, Robert Mercas, and Catalin Tiseanu. An algorithmic toolbox for periodic partial words. Discret. Appl. Math., 179:174-192, 2014. URL: https://doi.org/10.1016/J.DAM.2014.07.017.
  57. Michael S Paterson Michael J Fischer. String-matching and other products. In SCC, pages 113-125, 1974. Google Scholar
  58. Marius Nicolae and Sanguthevar Rajasekaran. On string matching with mismatches. Algorithms, 8(2):248-270, 2015. URL: https://doi.org/10.3390/a8020248.
  59. Marius Nicolae and Sanguthevar Rajasekaran. On pattern matching with k mismatches and few don't cares. Inf. Process. Lett., 118:78-82, 2017. URL: https://doi.org/10.1016/j.ipl.2016.10.003.
  60. Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully Dynamic Data Structure for LCE Queries in Compressed Space. In Proc. of MFCS 2016, volume 58 of LIPIcs, pages 72:1-72:14, 2016. URL: https://doi.org/10.4230/LIPIcs.MFCS.2016.72.
  61. Nicola Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Trans. Algorithms, 17(1), December 2021. URL: https://doi.org/10.1145/3426870.
  62. Mihai Pătrașcu. Unifying the landscape of cell-probe lower bounds. SIAM J. Comput., 40(3):827-847, 2011. URL: https://doi.org/10.1137/09075336X.
  63. Yuka Tanimura, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, Simon J. Puglisi, and Masayuki Takeda. Deterministic Sub-Linear Space LCE Data Structures With Efficient Construction. In Proc. of CPM 2016, volume 54 of LIPIcs, pages 1:1-1:10, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.1.
  64. Yuka Tanimura, Takaaki Nishimoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Small-space LCE data structure with constant-time queries. In Proc. of MFCS 2017, 2017. URL: https://doi.org/10.4230/LIPICS.MFCS.2017.10.
  65. I Tomohiro. Longest common extensions with recompression. In Proc. of CPM 2017, volume 78, page 18, 2017. URL: https://doi.org/10.4230/LIPICS.CPM.2017.18.
  66. Dirk Van Gucht, Ryan Williams, David P. Woodruff, and Qin Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In Proc. of PODS 2015, pages 199-212, 2015. URL: https://doi.org/10.1145/2745754.2745779.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail