High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.
@InProceedings{boucher_et_al:LIPIcs.WABI.2018.2, author = {Boucher, Christina and Gagie, Travis and Kuhnle, Alan and Manzini, Giovanni}, title = {{Prefix-Free Parsing for Building Big BWTs}}, booktitle = {18th International Workshop on Algorithms in Bioinformatics (WABI 2018)}, pages = {2:1--2:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-082-8}, ISSN = {1868-8969}, year = {2018}, volume = {113}, editor = {Parida, Laxmi and Ukkonen, Esko}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.2}, URN = {urn:nbn:de:0030-drops-93044}, doi = {10.4230/LIPIcs.WABI.2018.2}, annote = {Keywords: Burrows-Wheeler Transform, prefix-free parsing, compression-aware algorithms, genomic databases} }
Feedback for Dagstuhl Publishing