Linear Time Construction of Cover Suffix Tree and Applications

Radoszewski, Jakub

doi:10.4230/LIPIcs.ESA.2023.89

Abstract

The Cover Suffix Tree (CST) of a string T is the suffix tree of T with additional explicit nodes corresponding to halves of square substrings of T. In the CST an explicit node corresponding to a substring C of T is annotated with two numbers: the number of non-overlapping consecutive occurrences of C and the total number of positions in T that are covered by occurrences of C in T. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-n string in 𝒪(n log n) time. We give an algorithm that computes the same data structure in 𝒪(n) time assuming that T is over an integer alphabet and discuss its implications. A string C is a cover of text T if occurrences of C in T cover all positions of T; C is a seed of T if occurrences and overhangs (i.e., prefix-suffix occurrences) of C in T cover all positions of T. An α-partial cover (α-partial seed) of text T is a string C whose occurrences in T (occurrences and overhangs in T, respectively) cover at least α positions of T. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-n string T, one can compute a linear-sized representation of all seeds of T as well as all shortest α-partial covers and seeds in T for a given α in 𝒪(n) time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020); in particular, it is non-recursive. Kociumaka et al. (Algorithmica, 2015) proposed an 𝒪(n log n)-time algorithm for computing a shortest α-partial cover for each α = 1,…,n; we improve this complexity to 𝒪(n). Our results are based on a new combinatorial characterization of consecutive overlapping occurrences of a substring S of T in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in T. This new insight also leads to an 𝒪(n)-sized index for reporting overlapping consecutive occurrences of a given pattern P of length m in the optimal 𝒪(m+output) time, where output is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses 𝒪(n log n) space.

Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms, 3(2):19, 2007. URL: https://doi.org/10.1145/1240233.1240242.
Alberto Apostolico and Andrzej Ehrenfeucht. Efficient detection of quasiperiodicities in strings. Theoretical Computer Science, 119(2):247-265, 1993. URL: https://doi.org/10.1016/0304-3975(93)90159-Q.
Alberto Apostolico, Martin Farach, and Costas S. Iliopoulos. Optimal superprimitivity testing for strings. Information Processing Letters, 39(1):17-20, 1991. URL: https://doi.org/10.1016/0020-0190(91)90056-N.
Alberto Apostolico and Franco P. Preparata. Data structures and algorithms for the string statistics problem. Algorithmica, 15(5):481-494, 1996. URL: https://doi.org/10.1007/BF01955046.
Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, volume 78 of LIPIcs, pages 22:1-22:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.22.
Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In Paweł Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, volume 191 of LIPIcs, pages 8:1-8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.8.
Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, 4th Latin American Symposium on Theoretical Informatics, LATIN 2000, volume 1776 of Lecture Notes in Computer Science, pages 88-94. Springer, 2000. URL: https://doi.org/10.1007/10719839_9.
Srečko Brlek and Shuo Li. On the number of squares in a finite word, 2022. URL: https://arxiv.org/abs/2204.10204.
Srečko Brlek and Shuo Li. On the number of distinct squares in finite sequences: Some old and new results. In Anna E. Frid and Robert Mercas, editors, 14th International Conference on Combinatorics on Words, WORDS 2023, volume 13899 of Lecture Notes in Computer Science, pages 35-44. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-33180-0_3.
Gerth Stolting Brodal, Rune B. Lyngso, Anna Ostlin, and Christian N. S. Pedersen. Solving the string statistics problem in time O(n log n). In Peter Widmayer, Francisco Triguero Ruiz, Rafael Morales Bueno, Matthew Hennessy, Stephan J. Eidenbenz, and Ricardo Conejo, editors, 29th International Colloquium on Automata, Languages and Programming, ICALP 2002, volume 2380 of Lecture Notes in Computer Science, pages 728-739. Springer, 2002. URL: https://doi.org/10.1007/3-540-45465-9_62.
Gerth Stolting Brodal and Christian N. S. Pedersen. Finding maximal quasiperiodicities in strings. In Raffaele Giancarlo and David Sankoff, editors, 11th Annual Symposium on Combinatorial Pattern Matching, CPM 2000, volume 1848 of Lecture Notes in Computer Science, pages 397-411. Springer, 2000. URL: https://doi.org/10.1007/3-540-45123-4_33.
Mark R. Brown and Robert Endre Tarjan. A fast merging algorithm. Journal of the ACM, 26(2):211-226, 1979. URL: https://doi.org/10.1145/322123.322127.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Extracting powers and periods in a word from its runs structure. Theoretical Computer Science, 521:29-41, 2014. URL: https://doi.org/10.1016/j.tcs.2013.11.018.
Patryk Czajka and Jakub Radoszewski. Experimental evaluation of algorithms for computing quasiperiods. Theoretical Computer Science, 854:17-29, 2021. URL: https://doi.org/10.1016/j.tcs.2020.11.033.
Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 63:1-63:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ICALP.2021.63.
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: https://doi.org/10.2307/2034009.
Aviezri S. Fraenkel and Jamie Simpson. How many squares can a string contain? Journal of Combinatorial Theory, Series A, 82(1):112-120, 1998. URL: https://doi.org/10.1006/jcta.1997.2843.
Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. Journal of the ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
Moses Ganardi and Paweł Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Joseph (Seffi) Naor and Niv Buchbinder, editors, Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, pages 2833-2846. SIAM, 2022. URL: https://doi.org/10.1137/1.9781611977073.110.
Paweł Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Andreas S. Schulz and Dorothea Wagner, editors, 22th Annual European Symposium on Algorithms, Wrocław, Poland, ESA 2014, volume 8737 of Lecture Notes in Computer Science, pages 455-466. Springer, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_38.
John Hershberger. Finding the upper envelope of n line segments in O(n log n) time. Information Processing Letters, 33(4):169-174, 1989. URL: https://doi.org/10.1016/0020-0190(89)90136-1.
Costas S. Iliopoulos, Dennis W. G. Moore, and Kunsoo Park. Covering a string. Algorithmica, 16(3):288-297, 1996. URL: https://doi.org/10.1007/BF01955677.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear time algorithm for seeds computation. In Yuval Rabani, editor, Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, pages 1095-1112. SIAM, 2012. URL: https://doi.org/10.1137/1.9781611973099.86.
Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear-time algorithm for seeds computation. ACM Transations on Algorithms, 16(2):27:1-27:23, 2020. URL: https://doi.org/10.1145/3386369.
Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Fast algorithm for partial covers in words. Algorithmica, 73(1):217-233, 2015. URL: https://doi.org/10.1007/s00453-014-9915-3.
Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Efficient algorithms for shortest partial seeds in words. Theoretical Computer Science, 710:139-147, 2018. URL: https://doi.org/10.1016/j.tcs.2016.11.035.
Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, pages 596-604. IEEE Computer Society, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262-272, 1976. URL: https://doi.org/10.1145/321941.321946.
Dennis W. G. Moore and William F. Smyth. Computing the covers of a string in linear time. In Daniel Dominic Sleator, editor, Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 1994, pages 511-515. ACM/SIAM, 1994. URL: http://dl.acm.org/citation.cfm?id=314464.314636.
Dennis W. G. Moore and William F. Smyth. A correction to "An optimal algorithm to compute all the covers of a string". Information Processing Letters, 54(2):101-103, 1995. URL: https://doi.org/10.1016/0020-0190(94)00235-Q.
Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. Theoretical Computer Science, 638:108-111, 2016. URL: https://doi.org/10.1016/j.tcs.2016.02.005.

Linear Time Construction of Cover Suffix Tree and Applications

Author Jakub Radoszewski

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Linear Time Construction of Cover Suffix Tree and Applications

Author Jakub Radoszewski

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References