Matching Statistics Speed up BWT Construction

Author Francesco Masillo



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.83.pdf
  • Filesize: 0.84 MB
  • 15 pages

Document Identifiers

Author Details

Francesco Masillo
  • Department of Computer Science, University of Verona, Italy

Acknowledgements

I want to thank Sara Giuliani for listening and discussing the preliminary ideas contained in this paper. I also want to thank Zsuzsanna Lipt{á}k for giving helpful feedback during the writing of this paper.

Cite As Get BibTex

Francesco Masillo. Matching Statistics Speed up BWT Construction. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 83:1-83:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.ESA.2023.83

Abstract

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. In particular, the Burrows-Wheeler Transform (BWT) is the basis of some of the most popular self-indexes for genomic data, due to its known favourable behaviour on repetitive data.
Some tools that exploit the intrinsic repetitiveness of biological data have risen in popularity, due to their speed and low space consumption. We introduce a new algorithm for computing the BWT, which takes advantage of the redundancy of the data through a compressed version of matching statistics, the CMS of [Lipták et al., WABI 2022]. We show that it suffices to sort a small subset of suffixes, lowering both computation time and space. Our result is due to a new insight which links the so-called insert-heads of [Lipták et al., WABI 2022] to the well-known run boundaries of the BWT.
We give two implementations of our algorithm, called CMS-BWT, both competitive in our experimental validation on highly repetitive real-life datasets. In most cases, they outperform other tools w.r.t. running time, trading off a higher memory footprint, which, however, is still considerably smaller than the total size of the input data.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Burrows-Wheeler Transform
  • matching statistics
  • string collections
  • compressed representation
  • data structures
  • efficient algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Uwe Baier. Linear-time suffix sorting - A new approach for suffix array construction. In Proc. of the 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016), volume 54 of LIPIcs, pages 23:1-23:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. Google Scholar
  2. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483:134-148, 2013. Google Scholar
  3. Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. In Proc. of the 28th International Symposium on String Processing and Information Retrieval (SPIRE 2021), volume 12944 of Lecture Notes in Computer Science, pages 3-12. Springer, 2021. Google Scholar
  4. Christina Boucher, Ondrej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, and Massimiliano Rossi. PFP compressed suffix trees. In Proc. of the Symposium on Algorithm Engineering and Experiments (ALENEX 2021), pages 60-72. SIAM, 2021. Google Scholar
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. Google Scholar
  6. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, DIGITAL System Research Center, 1994. Google Scholar
  7. Rodrigo Cánovas and Gonzalo Navarro. Practical compressed suffix trees. In Proc. of the 9th International Symposium Experimental Algorithms, SEA 2010, volume 6049 of LNCS, pages 94-105. Springer, 2010. Google Scholar
  8. William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327-344, 1994. Google Scholar
  9. Fabio Cunial, Olgert Denas, and Djamal Belazzougui. Fast and compact matching statistics analytics. Bioinform., 38(7):1838-1845, 2022. Google Scholar
  10. Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. In Proc. of 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022), volume 223 of LIPIcs, pages 29:1-29:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. Google Scholar
  11. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. of the 41st Annual Symposium on Foundations of Computer Science (FOCS 2000), pages 390-398. IEEE Computer Society, 2000. Google Scholar
  12. Johannes Fischer. Combined data structure for previous- and next-smaller-values. Theor. Comput. Sci., 412(22):2451-2456, 2011. Google Scholar
  13. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. Google Scholar
  14. Sara Giuliani, Giuseppe Romana, and Massimiliano Rossi. Computing maximal unique matches with the r-index. In Proc. of the 20th International Symposium on Experimental Algorithms (SEA 2022), volume 233 of LIPIcs, pages 22:1-22:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. Google Scholar
  15. Keisuke Goto. Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. In Proc. of the Prague Stringology Conference 2019, pages 111-125. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2019. Google Scholar
  16. Ilya Grebnov. Code for libsais. URL: https://github.com/IlyaGrebnov/libsais.
  17. Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), volume 5577 of LNCS, pages 181-192. Springer, 2009. Google Scholar
  18. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357-359, 2012. Google Scholar
  19. Heng Li. Fast construction of FM-index for long sequence reads. Bioinform., 30(22):3274-3275, 2014. Google Scholar
  20. Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinform., 26(5):589-595, 2010. Google Scholar
  21. Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. Inf. Comput., 285(Part):104818, 2022. Google Scholar
  22. Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi. Suffix sorting via matching statistics. In Proc. of the 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022), volume 242 of LIPIcs, pages 20:1-20:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. Google Scholar
  23. Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey V. Zimin. Mummer4: A fast and versatile genome alignment system. PLoS Comput. Biol., 14(1), 2018. Google Scholar
  24. Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013. Google Scholar
  25. Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471-1484, 2011. Google Scholar
  26. Marco Oliva, Travis Gagie, and Christina Boucher. Recursive prefix-free parsing for building big BWTs. bioRxiv, pages 2023-01, 2023. Google Scholar
  27. Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169-187, 2022. Google Scholar
  28. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
  29. Dan E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Inf. Process. Lett., 17(2):81-84, 1983. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail