Efficient Construction of the BWT for Repetitive Text Using String Compression

Authors Diego Díaz-Domínguez, Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.CPM.2022.29.pdf
  • Filesize: 0.79 MB
  • 18 pages

Document Identifiers

Author Details

Diego Díaz-Domínguez
  • Department of Computer Science, University of Helsinki, Finland
Gonzalo Navarro
  • CeBiB - Centre For Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, Chile

Cite As Get BibTex

Diego Díaz-Domínguez and Gonzalo Navarro. Efficient Construction of the BWT for Repetitive Text Using String Compression. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 29:1-29:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.CPM.2022.29

Abstract

We present a new semi-external algorithm that builds the Burrows-Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • BWT
  • string compression
  • repetitive text

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013. Google Scholar
  2. Hans-J Boehm, Russ Atkinson, and Michael Plass. Ropes: An alternative to strings. Software: Practice and Experience, 25(12):1315-1330, 1995. Google Scholar
  3. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Computing the multi-string BWT and LCP array in external memory. Theoretical Computer Science, 862:42-58, 2021. Google Scholar
  4. Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 129-142, 2021. Google Scholar
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14, 2019. Article 13. Google Scholar
  6. Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  7. Diego Díaz-Domínguez and Gonzalo Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st Data Compression Conference (DCC), pages 83-92, 2021. Google Scholar
  8. Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14:6, 2019. Google Scholar
  9. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 390-398, 2000. Google Scholar
  10. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017. Google Scholar
  11. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. Google Scholar
  12. Richard Karp and Michael Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987. Google Scholar
  13. Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proc. 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1344-1357, 2019. Google Scholar
  14. Dominik Kempa and Tomasz Kociumaka. String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure. In Proc. 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 756-767, 2019. Google Scholar
  15. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Proc. 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 1002-1013. IEEE, 2020. Google Scholar
  16. John C. Kieffer and En Hui Yang. Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000. Google Scholar
  17. Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2-4):143-156, 2005. Google Scholar
  18. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), 2009. Article R25. Google Scholar
  19. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000. Google Scholar
  20. Heng Li. Fast construction of fm-index for long sequence reads. Bioinformatics, 30(22):3274-3275, 2014. Google Scholar
  21. Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589-595, 2010. Google Scholar
  22. Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678(1):22-39, 2017. Google Scholar
  23. Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and Giovanna Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms for Molecular Biology, 15, 2020. Article 18. Google Scholar
  24. Robert Love. Linux kernel development. Pearson Education, 2010. Google Scholar
  25. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru Tomescu. Genome-Scale Algorithm Design. Camb. U. Press, 2015. Google Scholar
  26. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  27. Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Computing Surveys, 54(2), 2021. Article 29. Google Scholar
  28. Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39:article 2, 2007. Google Scholar
  29. Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems, 31(3):1-15, 2013. Google Scholar
  30. Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th Data Compression Conference (DCC), pages 193-202, 2009. Google Scholar
  31. Daniel Saad Nogueira Nunes, Felipe A. Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. 28th Data Compression Conference (DCC), pages 42-51, 2018. Google Scholar
  32. Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. Google Scholar
  33. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-70. SIAM, 2007. Google Scholar
  34. Daisuke Okanohara and Kunihiko Sadakane. A linear-time Burrows-Wheeler transform using induced sorting. In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 90-101, 2009. Google Scholar
  35. Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. Big data: astronomical or genomical? PLoS Biology, 13(7):e1002195, 2015. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail