Prefix-Free Parsing for Building Big BWTs

Boucher, Christina; Gagie, Travis; Kuhnle, Alan; Manzini, Giovanni

doi:10.4230/LIPIcs.WABI.2018.2

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.

Cite As Get BibTex

Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 2:1-2:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.WABI.2018.2

Author Details

Christina Boucher

CISE, University of Florida, Gainesville, FL, USA

Travis Gagie

EIT, Diego Portales University, Santiago, Chile, 1exCeBiB, Santiago, Chile

Alan Kuhnle

CISE, University of Florida, Gainesville, FL, USA

Giovanni Manzini

University of Eastern Piedmont, Alessandria, Italy, 1exIIT, CNR, Pisa, Italy

Funding

Boucher, Christina: Partially supported by National Science Foundation grant 1618814.
Gagie, Travis: Partially supported by FONDECYT grant 1171058.
Kuhnle, Alan: Partially supported by National Science Foundation grant 1618814 and a post-doctoral fellowship from the University of Florida Informatics Institute.
Manzini, Giovanni: Partially supported by PRIN grant 201534HNXC

Supplementary Materials

Source code: https://gitlab.com/manzai/Big-BWT

References

rsync. URL: https://rsync.samba.org.
Repetitive corpus. URL: http://pizzachili.dcc.uchile.cl/repcorpus.html.
The sort transformation. URL: http://www.compressconsult.com/st.
Michael Burrows and David J. Wheeler. A block-sorting lossless compression algorithm. Technical report, Digital Equipment Corporation, 1994.
H.A. Carleton and P. Gerner-Smidt. Whole-genome sequencing is taking over foodborne disease surveillance. Microbe, 11:311-317, 2016.
Chia-Hua Chang, Min-Te Chou, Yi-Chung Wu, Ting-Wei Hong, Yun-Lung Li, Chia-Hsiang Yang, and Jui-Hung Hung. sBWT: memory efficient implementation of the hardware-acceleration-friendly Schindler transform for the fast biological sequence mapping. Bioinformatics, 32(22):3498-3500, 2016.
Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707-730, 2012.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. In Proceedings of the 29th Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357-360, 2012. URL: http://dx.doi.org/10.1038/nmeth.1923.
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):R25, 2009.
Heng Li and Richard Durbin. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics, 26(5):589-595, 2010.
Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009.
Felipe Alves Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017.
MetaSUB International Consortium. The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report. Microbiome, 4(1):24, 2016.
Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013.
Alberto Policriti and Nicola Prezza. From LZ77 to the run-length encoded burrows-wheeler transform, and back. In Proceedings of the 28th Symposium on Combinatorial Pattern Matching (CPM), pages 17:1-17:10, 2017.
Jouni Sirén. Burrows-Wheeler transform for terabases. In Proccedings of the 2016 Data Compression Conference (DCC), pages 211-220, 2016.
E.L. Stevens, R. Timme, E.W. Brown, M.W. Allard, E. Strain, K. Bunning, and S. Musser. The public health impact of a publically available, environmental database of microbial genomes. Frontiers in Microbiology, 8:808, 2017.
Eric L. Stevens, Ruth Timme, Eric W. Brown, Marc W. Allard, Errol Strain, Kelly Bunning, and Steven Musser. The public health impact of a publically available, environmental database of microbial genomes. Frontiers in Microbiology, 8:808, 2017.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015.
C. Turnbull et al. The 100,000 genomes project: bringing whole genome sequencing to the nhs. British Medical Journal, 361:k1687, 2018.

Prefix-Free Parsing for Building Big BWTs

Authors Christina Boucher , Travis Gagie , Alan Kuhnle , Giovanni Manzini

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message