Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space

Authors Dmitry Kosolobov , Nikita Sivukhin



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.20.pdf
  • Filesize: 0.88 MB
  • 18 pages

Document Identifiers

Author Details

Dmitry Kosolobov
  • Ural Federal University, Ekaterinburg, Russia
Nikita Sivukhin
  • Ural Federal University, Ekaterinburg, Russia

Cite AsGet BibTex

Dmitry Kosolobov and Nikita Sivukhin. Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 20:1-20:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.20

Abstract

The notions of synchronizing and partitioning sets are recently introduced variants of locally consistent parsings with a great potential in problem-solving. In this paper we propose a deterministic algorithm that constructs for a given readonly string of length n over the alphabet {0,1,…,n^{𝒪(1)}} a variant of a τ-partitioning set with size 𝒪(b) and τ = n/b using 𝒪(b) space and 𝒪(1/(ε)n) time provided b ≥ n^ε, for ε > 0. As a corollary, for b ≥ n^ε and constant ε > 0, we obtain linear time construction algorithms with 𝒪(b) space on top of the string for two major small-space indexes: a sparse suffix tree, which is a compacted trie built on b chosen suffixes of the string, and a longest common extension (LCE) index, which occupies 𝒪(b) space and allows us to compute the longest common prefix for any pair of substrings in 𝒪(n/b) time. For both, the 𝒪(b) construction storage is asymptotically optimal since the tree itself takes 𝒪(b) space and any LCE index with 𝒪(n/b) query time must occupy at least 𝒪(b) space by a known trade-off (at least for b ≥ Ω(n / log n)). In case of arbitrary b ≥ Ω(log² n), we present construction algorithms for the partitioning set, sparse suffix tree, and LCE index with 𝒪(nlog_b n) running time and 𝒪(b) space, thus also improving the state of the art.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • (τ,δ)-partitioning set
  • longest common extension
  • sparse suffix tree

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Alstrup, G. S. Brodal, and T. Rauhe. Pattern matching in dynamic texts. In Proc. 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 819-828. SIAM, 2000. Google Scholar
  2. P. Bille, J. Fischer, I.L. Gørtz, T. Kopelowitz, B. Sach, and H. W. Vildhøj. Sparse text indexing in small space. ACM Transactions on Algorithms, 12(3):1-19, 2016. URL: https://doi.org/10.1145/2836166.
  3. P. Bille, I. L. Gørtz, M. B. T. Knudsen, M. Lewenstein, and H. W. Vildhøj. Longest common extensions in sublinear space. In Proc. 26th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 9133 of LNCS, pages 65-76. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_6.
  4. P. Bille, I. L. Gørtz, B. Sach, and H. W. Vildhøj. Time-space trade-offs for longest common extensions. Journal of Discrete Algorithms, 25:42-50, 2014. URL: https://doi.org/10.1016/j.jda.2013.06.003.
  5. O. Birenzwige, S. Golan, and E. Porat. Locally consistent parsing for text indexing in small space. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 607-626. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
  6. G. S. Brodal, P. Davoodi, and S. S. Rao. On space efficient two dimensional range minimum data structures. In Proc. 18th Annual European Symposium on Algorithms (ESA), volume 6347 of LNCS, pages 171-182. Springer, 2010. URL: https://doi.org/10.1007/s00453-011-9499-0.
  7. R. Cole and U. Vishkin. Deterministic coin tossing with applications to optimal parallel list ranking. Information and Control, 70(1):32-53, 1986. URL: https://doi.org/10.1016/S0019-9958(86)80023-7.
  8. M. Crochemore and W. Rytter. Squares, cubes, and time-space efficient string searching. Algorithmica, 13(5):405-425, 1995. URL: https://doi.org/10.1007/BF01190846.
  9. P. Dinklage, J. Fischer, A. Herlez, T. Kociumaka, and F. Kurpicz. Practical performance of space efficient data structures for longest common extensions. In Proc. 28th Annual European Symposium on Algorithms (ESA), volume 173 of LIPIcs, pages 39:1-39:20, Dagstuhl, Germany, 2020. Schloss Dagstuhl-Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ESA.2020.39.
  10. M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science (FOCS), pages 137-143. IEEE, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
  11. J. Fischer, T. I, and D. Köppl. Deterministic sparse suffix sorting on rewritable texts. In Proc. 12th Latin American Symposium on Theoretical Informatics (LATIN), volume 9644 of LNCS, pages 483-496. Springer, 2016. URL: https://doi.org/10.1007/978-3-662-49529-2_36.
  12. M. Gańczorz, P. Gawrychowski, A. Jeż, and T. Kociumaka. Edit distance with block operations. In Proc. 26th Annual European Symposium on Algorithms (ESA), volume 112 of LIPIcs. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.ESA.2018.33.
  13. P. Gawrychowski, A. Karczmarz, T. Kociumaka, J. Łącki, and P. Sankowski. Optimal dynamic strings. In Proc. 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1509-1528. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.99.
  14. P. Gawrychowski and T. Kociumaka. Sparse suffix tree construction in optimal time and space. In Proc. 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 425-439. SIAM, 2017. URL: https://doi.org/10.1137/1.9781611974782.27.
  15. A. Goldberg, S. Plotkin, and G. Shannon. Parallel symmetry-breaking in sparse graphs. In Proc. 19th Annual ACM Symposium on Theory of Computing (STOC), pages 315-324. ACM, 1987. URL: https://doi.org/10.1145/28395.28429.
  16. D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338-355, 1984. URL: https://doi.org/10.1137/0213024.
  17. T. I. Longest common extensions with recompression. In Proc. 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 18:1-18:15, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.18.
  18. T. I, J. Kärkkäinen, and D. Kempa. Faster sparse suffix sorting. In Proc. 31st International Symposium on Theoretical Aspects of Computer Science (STACS), volume 25 of LIPIcs, pages 386-396, Dagstuhl, Germany, 2014. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.STACS.2014.386.
  19. A. Jeż. Approximation of grammar-based compression via recompression. Theoretical Computer Science, 592:115-134, 2015. URL: https://doi.org/10.1016/j.tcs.2015.05.027.
  20. A. Jeż. Faster fully compressed pattern matching by recompression. ACM Transactions on Algorithms, 11(3):20, 2015. URL: https://doi.org/10.1145/2631920.
  21. A. Jeż. A really simple approximation of smallest grammar. Theoretical Computer Science, 616:141-150, 2016. URL: https://doi.org/10.1016/j.tcs.2015.12.032.
  22. A. Jeż. Recompression: a simple and powerful technique for word equations. Journal of the ACM, 63(1):4, 2016. URL: https://doi.org/10.1145/2743014.
  23. J. Kärkkäinen, P. Sanders, and S. Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
  24. J. Kärkkäinen and E. Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP), pages 141-155. Carleton University Press, 1996. Google Scholar
  25. D. Kempa and T. Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proc. 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
  26. D. Kosolobov. Tight lower bounds for the longest common extension problem. Information Processing Letters, 125:26-29, 2017. URL: https://doi.org/10.1016/j.ipl.2017.05.003.
  27. D. Kosolobov and N. Sivukhin. Construction of sparse suffix trees and LCE indexes in optimal time and space. arXiv preprint arXiv:2105.03782, 2021. Google Scholar
  28. K. Mehlhorn, R. Sundar, and C. Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183-198, 1997. URL: https://doi.org/10.1007/BF02522825.
  29. T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. Dynamic index and LZ factorization in compressed space. Discrete Applied Mathematics, 274:116-129, 2020. URL: https://doi.org/10.1016/j.dam.2019.01.014.
  30. N. Prezza. Optimal substring equality queries with applications to sparse text indexing. ACM Transactions on Algorithms, 17(1):1-23, 2020. URL: https://doi.org/10.1145/3426870.
  31. M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
  32. S. C. Sahinalp and U. Vishkin. Symmetry breaking for suffix tree construction. In Proc. 26th Annual ACM Symposium on Theory of Computing (STOC), pages 300-309. ACM, 1994. URL: https://doi.org/10.1145/195058.195164.
  33. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proc. ACM SIGMOD International Conference on Management of Data, pages 76-85, 2003. URL: https://doi.org/10.1145/872757.872770.
  34. Y. Tanimura, T. I, H. Bannai, S. Inenaga, S.J. Puglisi, and M. Takeda. Deterministic sub-linear space LCE data structures with efficient construction. In Proc. 27th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 54 of LIPIcs, pages 1:1-1:10, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.1.
  35. D. E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.