Representing the Suffix Tree with the CDAWG

Authors Djamal Belazzougui, Fabio Cunial



PDF
Thumbnail PDF

File

LIPIcs.CPM.2017.7.pdf
  • Filesize: 0.66 MB
  • 13 pages

Document Identifiers

Author Details

Djamal Belazzougui
Fabio Cunial

Cite As Get BibTex

Djamal Belazzougui and Fabio Cunial. Representing the Suffix Tree with the CDAWG. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 7:1-7:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.CPM.2017.7

Abstract

Given a string T, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with e_T arcs, taking overall O(e_T+e_REV(T)) words of space, where REV(T) is the reverse of T, and supporting some key operations in time between O(1) and O(log(log(n))) in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which e_T grows sublinearly in the length of T in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between O(1) and O(log(n)) in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of T itself, that takes O(e_T) words of space, and that supports random access in O(log(n)) time. Furthermore, we establish a connection between the reversed CDAWG of T and a context-free grammar that produces T and only T, which might have independent interest.

Subject Classification

Keywords
  • CDAWG
  • suffix tree
  • heavy path decomposition
  • maximal repeat
  • context-free grammar

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Andrés Abeliuk, Rodrigo Cánovas, and Gonzalo Navarro. Practical compressed suffix trees. Algorithms, 6(2):319-351, 2013. URL: http://dx.doi.org/10.3390/a6020319.
  2. Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algorithms, 3(2):19, May 2007. URL: http://dx.doi.org/10.1145/1240233.1240242.
  3. Djamal Belazzougui, Patrick Hagge Cording, Simon J. Puglisi, and Yasuo Tabei. Access, rank, and select in grammar-compressed strings. In Nikhil Bansal and Irene Finocchi, editors, Proceedings of the 23rd Annual European Symposium on Algorithms (ESA 2015), volume 9294 of LNCS, pages 142-154. Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-662-48350-3_13.
  4. Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015), volume 9133 of LNCS, pages 26-39. Springer, Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-319-19929-0_3.
  5. Michael A. Bender and Martin Farach-Colton. The level ancestor problem simplified. Theor. Comput. Sci., 321(1):5-12, 2004. URL: http://dx.doi.org/10.1016/j.tcs.2003.05.002.
  6. Omer Berkman and Uzi Vishkin. Finding level-ancestors in trees. J. Comput. Syst. Sci., 48(2):214-230, 1994. URL: http://dx.doi.org/10.1016/S0022-0000(05)80002-9.
  7. Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput., 44(3):513-539, 2015. URL: http://dx.doi.org/10.1137/130936889.
  8. Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578-595, 1987. URL: http://dx.doi.org/10.1145/28869.28873.
  9. Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Alberto Apostolico and Jotun Hein, editors, Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching (CPM 1997), volume 1264 of LNCS, pages 116-129. Springer, 1997. URL: http://dx.doi.org/10.1007/3-540-63220-4_55.
  10. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. URL: http://dx.doi.org/10.1145/1082036.1082039.
  11. Travis Gagie. Large alphabets and incompressibility. Inf. Process. Lett., 99(6):246-251, 2006. URL: http://dx.doi.org/10.1016/j.ipl.2006.04.008.
  12. Rodrigo González, Gonzalo Navarro, and Héctor Ferrada. Locally compressed suffix arrays. ACM J. Exp. Algorithmics, 19:1-1, 2015. URL: http://dx.doi.org/10.1145/2594408.
  13. Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: http://dx.doi.org/10.1017/CBO9780511574931.
  14. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984. URL: http://dx.doi.org/10.1137/0213024.
  15. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Alberto Apostolico, Maxime Crochemore, and Kunsoo Park, editors, Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM 2005), volume 3537 of LNCS, pages 45-56. Springer, Springer, 2005. URL: http://dx.doi.org/10.1007/11496656_5.
  16. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: http://dx.doi.org/10.1089/cmb.2009.0169.
  17. Gonzalo Navarro and Alberto Ordóñez Pereira. Faster compressed suffix trees for repetitive text collections. In Joachim Gudmundsson and Jyrki Katajainen, editors, Proceedings of the 13th International Symposium on Experimental Algorithms (SEA 2014), volume 8504 of LNCS, pages 424-435. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-319-07959-2_36.
  18. Gonzalo Navarro and Luís M. S. Russo. Fast fully-compressed suffix trees. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Proceedings of the 2014 Data Compression Conference (DCC 2014), pages 283-291. IEEE, IEEE, 2014. URL: http://dx.doi.org/10.1109/DCC.2014.40.
  19. Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Trans. Algorithms, 10(3):16, 2014. URL: http://dx.doi.org/10.1145/2601073.
  20. Mathieu Raffinot. On maximal repeats in strings. Inf. Process. Lett., 80(3):165-169, 2001. URL: http://dx.doi.org/10.1016/S0020-0190(01)00152-1.
  21. Luís M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully compressed suffix trees. ACM Trans. Algorithms, 7(4):53:1-53:34, 2011. URL: http://dx.doi.org/10.1145/2000807.2000821.
  22. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589-607, 2007. URL: http://dx.doi.org/10.1007/S00224-006-1198-X.
  23. Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro. Run-length compressed indexes are superior for highly repetitive sequence collections. In Amihood Amir, Andrew Turpin, and Alistair Moffat, editors, Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE 2008), volume 5280 of LNCS, pages 164-175. Springer, 2008. URL: http://dx.doi.org/10.1007/978-3-540-89097-3_17.
  24. Alexander J. Smola and S. V. N. Vishwanathan. Fast kernels for string and tree matching. In Suzanna Becker, Sebastan Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems (NIPS 2002), volume 15, pages 585-592. MIT Press, 2002. URL: http://papers.nips.cc/paper/2272-fast-kernels-for-string-and-tree-matching.pdf.
  25. Elad Verbin and Wei Yu. Data structure lower bounds on random access to grammar-compressed strings. In Johannes Fischer and Peter Sanders, editors, Proceedings of the 24th Annual Symposium on Combinatorial Pattern Matching (CPM 2013), volume 7922 of LNCS, pages 247-258. Springer, Springer, 2013. URL: http://dx.doi.org/10.1007/978-3-642-38905-4_24.
  26. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983. URL: http://dx.doi.org/10.1016/0020-0190(83)90075-3.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail