Enhancing Generalized Compressed Suffix Trees, with Applications

Authors Sankardeep Chakraborty , Kunihiko Sadakane , Wiktor Zuba



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2024.18.pdf
  • Filesize: 0.83 MB
  • 15 pages

Document Identifiers

Author Details

Sankardeep Chakraborty
  • The University of Tokyo, Japan
Kunihiko Sadakane
  • The University of Tokyo, Japan
Wiktor Zuba
  • CWI, Amsterdam, The Netherlands

Cite As Get BibTex

Sankardeep Chakraborty, Kunihiko Sadakane, and Wiktor Zuba. Enhancing Generalized Compressed Suffix Trees, with Applications. In 35th International Symposium on Algorithms and Computation (ISAAC 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 322, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/LIPIcs.ISAAC.2024.18

Abstract

Generalized suffix trees are data structures for storing and searching a set of strings. Though many string problems can be solved efficiently using them, their space usage can be large relative to the size of the input strings. For a set of strings with n characters in total, generalized suffix trees use O(n log n) bit space, which is much larger than the strings that occupy n log σ bits where σ is the alphabet size. Generalized compressed suffix trees use just O(n log σ) bits but support the same basic operations as the generalized suffix trees. However, for some sophisticated operations we need to add auxiliary data structures of O(n log n) bits. This becomes a bottleneck for applications involving big data. In this paper, we enhance the generalized compressed suffix trees while still retaining their space efficiency. First, we give an auxiliary data structure of O(n) bits for generalized compressed suffix trees such that given a suffix s of a string and another string t, we can find the suffix of t that is closest to s. Next, we give a o(n) bit data structure for finding the ancestor of a node in a (generalized) compressed suffix tree with given string depth. Finally, we give data structures for a generalization of the document listing problem from arrays to trees. We also show their applications to suffix-prefix matching problems.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • suffix tree
  • compact data structure
  • suffix-prefix query
  • weighted level ancestor

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004. The 9th International Symposium on String Processing and Information Retrieval. URL: https://doi.org/10.1016/S1570-8667(03)00065-0.
  2. Jérémy Barbay, Meng He, J. Ian Munro, and Srinivasa Rao Satti. Succinct indexes for strings, binary relations and multilabeled trees. ACM Trans. Algorithms, 7(4), September 2011. URL: https://doi.org/10.1145/2000807.2000820.
  3. Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted Ancestors in Suffix Trees Revisited. In Paweł Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021), volume 191 of Leibniz International Proceedings in Informatics (LIPIcs), pages 8:1-8:15, Dagstuhl, Germany, 2021. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.8.
  4. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms, 11(4), April 2015. URL: https://doi.org/10.1145/2629339.
  5. Arthur L. Delcher, Adam Phillippy, Jane Carlton, and Steven L. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research, 30(11):2478-2483, June 2002. URL: https://doi.org/10.1093/nar/30.11.2478.
  6. Arash Farzan and J. Ian Munro. A uniform paradigm to succinctly encode various families of trees. Algorithmica, 68(1):16-40, 2014. URL: https://doi.org/10.1007/S00453-012-9664-0.
  7. P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552-581, 2005. URL: https://doi.org/10.1145/1082036.1082039.
  8. Johannes Fischer and Volker Heun. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Bo Chen, Mike Paterson, and Guochuan Zhang, editors, Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, pages 459-470, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. URL: https://doi.org/10.1007/978-3-540-74450-4_41.
  9. Michael L. Fredman and Robert Endre Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM, 34(3):596-615, July 1987. URL: https://doi.org/10.1145/28869.28874.
  10. Paweł Gawrychowski, Moshe Lewenstein, and Patrick K. Nicholson. Weighted ancestors in suffix trees. In Andreas S. Schulz and Dorothea Wagner, editors, Algorithms - ESA 2014, pages 455-466, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. Google Scholar
  11. Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA '06, pages 368-373, USA, 2006. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1109557.1109599.
  12. R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-Compressed Text Indexes. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841-850, 2003. Google Scholar
  13. R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing, 35(2):378-407, 2005. URL: https://doi.org/10.1137/S0097539702402354.
  14. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. Google Scholar
  15. Tsvi Kopelowitz and Moshe Lewenstein. Dynamic weighted ancestors. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, pages 565-574, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283444.
  16. S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software - Practice and Experience, 29(13):1149-1171, 1999. URL: https://doi.org/10.1002/(SICI)1097-024X(199911)29:13%3C1149::AID-SPE274%3E3.0.CO;2-O.
  17. Stefan Kurtz, Adam Phillippy, Arthur Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5:R12, February 2004. URL: https://doi.org/10.1186/gb-2004-5-2-r12.
  18. Grigorios Loukides, Solon P. Pissis, Sharma V. Thankachan, and Wiktor Zuba. Suffix-Prefix Queries on a Dictionary. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023), volume 259 of Leibniz International Proceedings in Informatics (LIPIcs), pages 21:1-21:20, Dagstuhl, Germany, 2023. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2023.21.
  19. U. Manber and G. Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
  20. J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762-776, 2001. URL: https://doi.org/10.1137/S0097539799364092.
  21. J. I. Munro, V. Raman, and S. R. Satti. Space Efficient Suffix Trees. Journal of Algorithms, 39:205-222, 2001. URL: https://doi.org/10.1006/JAGM.2000.1151.
  22. S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 657-666, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
  23. G. Navarro and K. Sadakane. Fully-Functional Static and Dynamic Succinct Trees. ACM Transactions on Algorithms (TALG), 10(3):Article No. 16, 39 pages, 2014. Google Scholar
  24. R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG), 3(4), 2007. URL: https://doi.org/10.1145/1290672.1290680.
  25. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589-607, 2007. URL: https://doi.org/10.1007/S00224-006-1198-X.
  26. Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12-22, 2007. URL: https://doi.org/10.1016/J.JDA.2006.03.011.
  27. Daniel D. Sleator and Robert Endre Tarjan. A data structure for dynamic trees. Journal of Computer and System Sciences, 26(3):362-391, 1983. URL: https://doi.org/10.1016/0022-0000(83)90006-5.
  28. Dekel Tsur. Succinct data structures for nearest colored node in a tree. Information Processing Letters, 132:6-10, 2018. URL: https://doi.org/10.1016/j.ipl.2017.10.001.
  29. P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of IEEE Symposium on Switching and Automata Theory, pages 1-11, 1973. Google Scholar
  30. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space θ(n). Information Processing Letters, 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.
  31. Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan. Approximate suffix-prefix dictionary queries. In Rastislav Královic and Antonín Kucera, editors, 49th International Symposium on Mathematical Foundations of Computer Science, MFCS 2024, August 26-30, 2024, Bratislava, Slovakia, volume 306 of LIPIcs, pages 85:1-85:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2024. URL: https://doi.org/10.4230/LIPICS.MFCS.2024.85.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail