Efficient Construction of the BWT for Repetitive Text Using String Compression

Díaz-Domínguez, Diego; Navarro, Gonzalo

doi:10.4230/LIPIcs.CPM.2022.29

File

Author Details

Diego Díaz-Domínguez

Department of Computer Science, University of Helsinki, Finland

Gonzalo Navarro

CeBiB - Centre For Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, Chile

Cite AsGet BibTex

Diego Díaz-Domínguez and Gonzalo Navarro. Efficient Construction of the BWT for Repetitive Text Using String Compression. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 29:1-29:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.29

Abstract

We present a new semi-external algorithm that builds the Burrows-Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.

Subject Classification

ACM Subject Classification

Theory of computation → Data compression

Keywords

BWT
string compression
repetitive text

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013.
Hans-J Boehm, Russ Atkinson, and Michael Plass. Ropes: An alternative to strings. Software: Practice and Experience, 25(12):1315-1330, 1995.
Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Computing the multi-string BWT and LCP array in external memory. Theoretical Computer Science, 862:42-58, 2021.
Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 129-142, 2021.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14, 2019. Article 13.
Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
Diego Díaz-Domínguez and Gonzalo Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31st Data Compression Conference (DCC), pages 83-92, 2021.
Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14:6, 2019.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 390-398, 2000.
Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014.
Richard Karp and Michael Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987.
Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proc. 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1344-1357, 2019.
Dominik Kempa and Tomasz Kociumaka. String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure. In Proc. 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 756-767, 2019.
Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Proc. 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 1002-1013. IEEE, 2020.
John C. Kieffer and En Hui Yang. Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000.
Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2-4):143-156, 2005.
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), 2009. Article R25.
Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000.
Heng Li. Fast construction of fm-index for long sequence reads. Bioinformatics, 30(22):3274-3275, 2014.
Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589-595, 2010.
Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678(1):22-39, 2017.
Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and Giovanna Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms for Molecular Biology, 15, 2020. Article 18.
Robert Love. Linux kernel development. Pearson Education, 2010.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru Tomescu. Genome-Scale Algorithm Design. Camb. U. Press, 2015.
Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
Gonzalo Navarro. Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures. ACM Computing Surveys, 54(2), 2021. Article 29.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39:article 2, 2007.
Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems, 31(3):1-15, 2013.
Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th Data Compression Conference (DCC), pages 193-202, 2009.
Daniel Saad Nogueira Nunes, Felipe A. Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. 28th Data Compression Conference (DCC), pages 42-51, 2018.
Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013.
Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 60-70. SIAM, 2007.
Daisuke Okanohara and Kunihiko Sadakane. A linear-time Burrows-Wheeler transform using induced sorting. In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 90-101, 2009.
Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, and Gene E. Robinson. Big data: astronomical or genomical? PLoS Biology, 13(7):e1002195, 2015.

Efficient Construction of the BWT for Repetitive Text Using String Compression

Authors Diego Díaz-Domínguez, Gonzalo Navarro

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Efficient Construction of the BWT for Repetitive Text Using String Compression

Authors Diego Díaz-Domínguez, Gonzalo Navarro

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message