Suffix Sorting via Matching Statistics

Authors Zsuzsanna Lipták , Francesco Masillo , Simon J. Puglisi



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.20.pdf
  • Filesize: 0.96 MB
  • 15 pages

Document Identifiers

Author Details

Zsuzsanna Lipták
  • Department of Computer Science, University of Verona, Italy
Francesco Masillo
  • Department of Computer Science, University of Verona, Italy
Simon J. Puglisi
  • Helsinki Institute for Information Technology (HIIT), Finland
  • Department of Computer Science, University of Helsinki, Finland

Cite AsGet BibTex

Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi. Suffix Sorting via Matching Statistics. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 20:1-20:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.20

Abstract

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Generalized suffix array
  • matching statistics
  • string collections
  • compressed representation
  • data structures
  • efficient algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53-86, 2004. Google Scholar
  2. Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In Proc. 17th International Symposium on Experimental Algorithms (SEA), volume 103 of LIPIcs, pages 17:1-17:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. Google Scholar
  3. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. Google Scholar
  4. Rodrigo Cánovas and Gonzalo Navarro. Practical compressed suffix trees. In Proc. of the 9th International Symposium Experimental Algorithms, SEA 2010, volume 6049 of LNCS, pages 94-105. Springer, 2010. Google Scholar
  5. William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327-344, 1994. Google Scholar
  6. Johannes Fischer. Combined data structure for previous- and next-smaller-values. Theor. Comput. Sci., 412(22):2451-2456, 2011. Google Scholar
  7. Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. In Proc. of the Prague Stringology Conference 2017, pages 62-76. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2017. Google Scholar
  8. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. Google Scholar
  9. Hideo Itoh and Hozumi Tanaka. An efficient method for in memory construction of suffix arrays. In Proc. of the 6th International Symposium on String Processing and Information Retrieval and the 5th International Workshop on Groupware, (SPIRE/CRIWG), pages 81-88. IEEE Computer Society, 1999. Google Scholar
  10. Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi, and Bella Zhukova. Engineering external memory induced suffix sorting. In Proc. of the 19th Workshop on Algorithm Engineering and Experiments, ALENEX 2017, pages 98-108. SIAM, 2017. Google Scholar
  11. Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching, CPM 2009, volume 5577 of LNCS, pages 181-192. Springer, 2009. Google Scholar
  12. Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. J. Discrete Algorithms, 3(2-4):143-156, 2005. Google Scholar
  13. Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017. Google Scholar
  14. Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and Giovanna Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol. Biol., 15(1):18, 2020. Google Scholar
  15. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. URL: https://doi.org/10.1017/CBO9781139940023.
  16. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  17. Yuta Mori. Code for divsufsort. URL: https://github.com/y-256/libdivsufsort.
  18. Yuta Mori. Code for sais-lite. URL: https://sites.google.com/site/yuta256/sais.
  19. Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013. Google Scholar
  20. Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011. Google Scholar
  21. Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. URL: http://www.oldenbusch-verlag.de/.
  22. Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In Proc. of the 17th International Symposium on String Processing and Information Retrieval, SPIRE 2010, volume 6393 of LNCS, pages 347-358. Springer, 2010. Google Scholar
  23. Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2):4, 2007. Google Scholar
  24. Simon J. Puglisi and Bella Zhukova. Relative lempel-ziv compression of suffix arrays. In Proc. of the 27th International Symposium on String Processing and Information Retrieval, SPIRE 2020, volume 12303 of LNCS, pages 89-96. Springer, 2020. Google Scholar
  25. Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188-194, 2022. Google Scholar
  26. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
  27. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail