Suffix Sorting via Matching Statistics

Lipták, Zsuzsanna; Masillo, Francesco; Puglisi, Simon J.

doi:10.4230/LIPIcs.WABI.2022.20

File

Author Details

Zsuzsanna Lipták

Department of Computer Science, University of Verona, Italy

Francesco Masillo

Department of Computer Science, University of Verona, Italy

Simon J. Puglisi

Helsinki Institute for Information Technology (HIIT), Finland
Department of Computer Science, University of Helsinki, Finland

Cite AsGet BibTex

Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi. Suffix Sorting via Matching Statistics. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 20:1-20:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.20

Abstract

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms

Keywords

Generalized suffix array
matching statistics
string collections
compressed representation
data structures
efficient algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53-86, 2004.
Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In Proc. 17th International Symposium on Experimental Algorithms (SEA), volume 103 of LIPIcs, pages 17:1-17:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019.
Rodrigo Cánovas and Gonzalo Navarro. Practical compressed suffix trees. In Proc. of the 9th International Symposium Experimental Algorithms, SEA 2010, volume 6049 of LNCS, pages 94-105. Springer, 2010.
William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327-344, 1994.
Johannes Fischer. Combined data structure for previous- and next-smaller-values. Theor. Comput. Sci., 412(22):2451-2456, 2011.
Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. In Proc. of the Prague Stringology Conference 2017, pages 62-76. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2017.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020.
Hideo Itoh and Hozumi Tanaka. An efficient method for in memory construction of suffix arrays. In Proc. of the 6th International Symposium on String Processing and Information Retrieval and the 5th International Workshop on Groupware, (SPIRE/CRIWG), pages 81-88. IEEE Computer Society, 1999.
Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi, and Bella Zhukova. Engineering external memory induced suffix sorting. In Proc. of the 19th Workshop on Algorithm Engineering and Experiments, ALENEX 2017, pages 98-108. SIAM, 2017.
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching, CPM 2009, volume 5577 of LNCS, pages 181-192. Springer, 2009.
Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. J. Discrete Algorithms, 3(2-4):143-156, 2005.
Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017.
Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and Giovanna Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol. Biol., 15(1):18, 2020.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. URL: https://doi.org/10.1017/CBO9781139940023.
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
Yuta Mori. Code for divsufsort. URL: https://github.com/y-256/libdivsufsort.
Yuta Mori. Code for sais-lite. URL: https://sites.google.com/site/yuta256/sais.
Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011.
Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. URL: http://www.oldenbusch-verlag.de/.
Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In Proc. of the 17th International Symposium on String Processing and Information Retrieval, SPIRE 2010, volume 6393 of LNCS, pages 347-358. Springer, 2010.
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2):4, 2007.
Simon J. Puglisi and Bella Zhukova. Relative lempel-ziv compression of suffix arrays. In Proc. of the 27th International Symposium on String Processing and Information Retrieval, SPIRE 2020, volume 12303 of LNCS, pages 89-96. Springer, 2020.
Massimiliano Rossi, Marco Oliva, Paola Bonizzoni, Ben Langmead, Travis Gagie, and Christina Boucher. Finding maximal exact matches using the r-index. J. Comput. Biol., 29(2):188-194, 2022.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015.
Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983.

Suffix Sorting via Matching Statistics

Authors Zsuzsanna Lipták , Francesco Masillo , Simon J. Puglisi

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Suffix Sorting via Matching Statistics

Authors Zsuzsanna Lipták , Francesco Masillo , Simon J. Puglisi

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message