Document Open Access Logo

Prefix-Free Parsing for Building Big BWTs

Authors Christina Boucher , Travis Gagie , Alan Kuhnle , Giovanni Manzini



PDF
Thumbnail PDF

File

LIPIcs.WABI.2018.2.pdf
  • Filesize: 0.69 MB
  • 16 pages

Document Identifiers

Author Details

Christina Boucher
  • CISE, University of Florida, Gainesville, FL, USA
Travis Gagie
  • EIT, Diego Portales University, Santiago, Chile, 1exCeBiB, Santiago, Chile
Alan Kuhnle
  • CISE, University of Florida, Gainesville, FL, USA
Giovanni Manzini
  • University of Eastern Piedmont, Alessandria, Italy, 1exIIT, CNR, Pisa, Italy

Cite AsGet BibTex

Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 2:1-2:16, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.2

Abstract

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Burrows-Wheeler Transform
  • prefix-free parsing
  • compression-aware algorithms
  • genomic databases

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. rsync. URL: https://rsync.samba.org.
  2. Repetitive corpus. URL: http://pizzachili.dcc.uchile.cl/repcorpus.html.
  3. The sort transformation. URL: http://www.compressconsult.com/st.
  4. Michael Burrows and David J. Wheeler. A block-sorting lossless compression algorithm. Technical report, Digital Equipment Corporation, 1994. Google Scholar
  5. H.A. Carleton and P. Gerner-Smidt. Whole-genome sequencing is taking over foodborne disease surveillance. Microbe, 11:311-317, 2016. Google Scholar
  6. Chia-Hua Chang, Min-Te Chou, Yi-Chung Wu, Ting-Wei Hong, Yun-Lung Li, Chia-Hsiang Yang, and Jui-Hung Hung. sBWT: memory efficient implementation of the hardware-acceleration-friendly Schindler transform for the fast biological sequence mapping. Bioinformatics, 32(22):3498-3500, 2016. Google Scholar
  7. Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707-730, 2012. Google Scholar
  8. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005. Google Scholar
  9. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. In Proceedings of the 29th Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018. Google Scholar
  10. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357-360, 2012. URL: http://dx.doi.org/10.1038/nmeth.1923.
  11. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):R25, 2009. Google Scholar
  12. Heng Li and Richard Durbin. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics, 26(5):589-595, 2010. Google Scholar
  13. Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967, 2009. Google Scholar
  14. Felipe Alves Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017. Google Scholar
  15. MetaSUB International Consortium. The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report. Microbiome, 4(1):24, 2016. Google Scholar
  16. Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013. Google Scholar
  17. Alberto Policriti and Nicola Prezza. From LZ77 to the run-length encoded burrows-wheeler transform, and back. In Proceedings of the 28th Symposium on Combinatorial Pattern Matching (CPM), pages 17:1-17:10, 2017. Google Scholar
  18. Jouni Sirén. Burrows-Wheeler transform for terabases. In Proccedings of the 2016 Data Compression Conference (DCC), pages 211-220, 2016. Google Scholar
  19. E.L. Stevens, R. Timme, E.W. Brown, M.W. Allard, E. Strain, K. Bunning, and S. Musser. The public health impact of a publically available, environmental database of microbial genomes. Frontiers in Microbiology, 8:808, 2017. Google Scholar
  20. Eric L. Stevens, Ruth Timme, Eric W. Brown, Marc W. Allard, Errol Strain, Kelly Bunning, and Steven Musser. The public health impact of a publically available, environmental database of microbial genomes. Frontiers in Microbiology, 8:808, 2017. Google Scholar
  21. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. Google Scholar
  22. C. Turnbull et al. The 100,000 genomes project: bringing whole genome sequencing to the nhs. British Medical Journal, 361:k1687, 2018. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail