Matching Statistics Speed up BWT Construction

Masillo, Francesco

doi:10.4230/LIPIcs.ESA.2023.83

File

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms

Keywords

Burrows-Wheeler Transform
matching statistics
string collections
compressed representation
data structures
efficient algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. In particular, the Burrows-Wheeler Transform (BWT) is the basis of some of the most popular self-indexes for genomic data, due to its known favourable behaviour on repetitive data. Some tools that exploit the intrinsic repetitiveness of biological data have risen in popularity, due to their speed and low space consumption. We introduce a new algorithm for computing the BWT, which takes advantage of the redundancy of the data through a compressed version of matching statistics, the CMS of [Lipták et al., WABI 2022]. We show that it suffices to sort a small subset of suffixes, lowering both computation time and space. Our result is due to a new insight which links the so-called insert-heads of [Lipták et al., WABI 2022] to the well-known run boundaries of the BWT. We give two implementations of our algorithm, called CMS-BWT, both competitive in our experimental validation on highly repetitive real-life datasets. In most cases, they outperform other tools w.r.t. running time, trading off a higher memory footprint, which, however, is still considerably smaller than the total size of the input data.

Cite As Get BibTex

Francesco Masillo. Matching Statistics Speed up BWT Construction. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 83:1-83:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ESA.2023.83

Author Details

Francesco Masillo

Department of Computer Science, University of Verona, Italy

References

Uwe Baier. Linear-time suffix sorting - A new approach for suffix array construction. In Proc. of the 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016), volume 54 of LIPIcs, pages 23:1-23:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483:134-148, 2013.
Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. In Proc. of the 28th International Symposium on String Processing and Information Retrieval (SPIRE 2021), volume 12944 of Lecture Notes in Computer Science, pages 3-12. Springer, 2021.
Christina Boucher, Ondrej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, and Massimiliano Rossi. PFP compressed suffix trees. In Proc. of the Symposium on Algorithm Engineering and Experiments (ALENEX 2021), pages 60-72. SIAM, 2021.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019.
Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, DIGITAL System Research Center, 1994.
Rodrigo Cánovas and Gonzalo Navarro. Practical compressed suffix trees. In Proc. of the 9th International Symposium Experimental Algorithms, SEA 2010, volume 6049 of LNCS, pages 94-105. Springer, 2010.
William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327-344, 1994.
Fabio Cunial, Olgert Denas, and Djamal Belazzougui. Fast and compact matching statistics analytics. Bioinform., 38(7):1838-1845, 2022.
Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. In Proc. of 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), volume 223 of LIPIcs, pages 29:1-29:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. of the 41st Annual Symposium on Foundations of Computer Science (FOCS 2000), pages 390-398. IEEE Computer Society, 2000.
Johannes Fischer. Combined data structure for previous- and next-smaller-values. Theor. Comput. Sci., 412(22):2451-2456, 2011.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020.
Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi. Computing maximal unique matches with the r-index. In Proc. of the 20th International Symposium on Experimental Algorithms (SEA 2022), volume 233 of LIPIcs, pages 22:1-22:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.
Keisuke Goto. Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. In Proc. of the Prague Stringology Conference 2019, pages 111-125. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2019.
Ilya Grebnov. Code for libsais. URL: https://github.com/IlyaGrebnov/libsais.
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), volume 5577 of LNCS, pages 181-192. Springer, 2009.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357-359, 2012.
Heng Li. Fast construction of FM-index for long sequence reads. Bioinform., 30(22):3274-3275, 2014.
Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinform., 26(5):589-595, 2010.
Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. Inf. Comput., 285(Part):104818, 2022.
Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi. Suffix sorting via matching statistics. In Proc. of the 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022), volume 242 of LIPIcs, pages 20:1-20:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.
Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey V. Zimin. Mummer4: A fast and versatile genome alignment system. PLoS Comput. Biol., 14(1), 2018.
Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011.
Marco Oliva, Travis Gagie, and Christina Boucher. Recursive prefix-free parsing for building big BWTs. bioRxiv, pages 2023-01, 2023.
Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169-187, 2022.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015.
Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983.

Matching Statistics Speed up BWT Construction

Author Francesco Masillo

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Matching Statistics Speed up BWT Construction

Author Francesco Masillo

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message