Linear Time Construction of Cover Suffix Tree and Applications

eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2023-08-30 89:1 89:17 10.4230/LIPIcs.ESA.2023.89 article Linear Time Construction of Cover Suffix Tree and Applications Radoszewski, Jakub 1 2 https://orcid.org/0000-0002-0067-6401 University of Warsaw, Poland Samsung R&D, Warsaw, Poland The Cover Suffix Tree (CST) of a string T is the suffix tree of T with additional explicit nodes corresponding to halves of square substrings of T. In the CST an explicit node corresponding to a substring C of T is annotated with two numbers: the number of non-overlapping consecutive occurrences of C and the total number of positions in T that are covered by occurrences of C in T. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-n string in 𝒪(n log n) time. We give an algorithm that computes the same data structure in 𝒪(n) time assuming that T is over an integer alphabet and discuss its implications. A string C is a cover of text T if occurrences of C in T cover all positions of T; C is a seed of T if occurrences and overhangs (i.e., prefix-suffix occurrences) of C in T cover all positions of T. An α-partial cover (α-partial seed) of text T is a string C whose occurrences in T (occurrences and overhangs in T, respectively) cover at least α positions of T. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-n string T, one can compute a linear-sized representation of all seeds of T as well as all shortest α-partial covers and seeds in T for a given α in 𝒪(n) time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020); in particular, it is non-recursive. Kociumaka et al. (Algorithmica, 2015) proposed an 𝒪(n log n)-time algorithm for computing a shortest α-partial cover for each α = 1,…,n; we improve this complexity to 𝒪(n). Our results are based on a new combinatorial characterization of consecutive overlapping occurrences of a substring S of T in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in T. This new insight also leads to an 𝒪(n)-sized index for reporting overlapping consecutive occurrences of a given pattern P of length m in the optimal 𝒪(m+output) time, where output is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses 𝒪(n log n) space. https://drops.dagstuhl.de/storage/00lipics/lipics-vol274-esa2023/LIPIcs.ESA.2023.89/LIPIcs.ESA.2023.89.pdf cover (quasiperiod) seed suffix tree run (maximal repetition)