Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Goga, Adrián; Baláž, Andrej

doi:10.4230/LIPIcs.WABI.2022.18

Abstract

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019).
Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process.
To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary.
To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.

Cite As Get BibTex

Adrián Goga and Andrej Baláž. Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 18:1-18:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.WABI.2022.18

Author Details

Adrián Goga

Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia

Andrej Baláž

Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia

Funding

Goga, Adrián: VEGA grant 1/0463/20; EU Horizon 2020 grant No. 956229 (ALPACA); Comenius University grant for doctoral students No. 422
Baláž, Andrej: VEGA grant 1/0538/22; EU Horizon 2020 grant No. 956229 (ALPACA)

Acknowledgements

We want to thank Travis Gagie for the conception of the idea during his data structures course and helpful remarks throughout the realisation of this project. Our thanks also go to Broňa Brejová for useful advice during the writing process and Uwe Baier for kindly responding to our questions via email. Finally, we thank Lucas Pansani Ramos for the aid he provided in the early days of the project.

Supplementary Materials

Software (Source Code) https://github.com/fmfi-compbio/pfp_wg browse archived version

References

Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 911-930. SIAM, 2020.
Uwe Baier. On Undetected Redundancy in the Burrows-Wheeler Transform. Annual Symposium on Combinatorial Pattern Matching (CPM 2018), 105:3:1-3:15, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.3.
Uwe Baier, Thomas Büchler, Enno Ohlebusch, and Pascal Weber. Edge minimization in de Bruijn graphs. In 2020 Data Compression Conference (DCC), pages 223-232. IEEE, 2020.
Uwe Baier and Kadir Dede. BWT Tunnel Planning is hard but manageable. In 2019 Data Compression Conference (DCC), pages 142-151. IEEE, 2019.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):1-15, 2019.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science, pages 390-398. IEEE, 2000.
Johannes Fischer and Florian Kurpicz. Dismantling divsufsort. arXiv preprint, 2017. URL: http://arxiv.org/abs/1710.01896.
Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical computer science, 698:67-78, 2017.
Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9):875-879, 2018.
Daniel Gibney and Sharma V Thankachan. On the complexity of recognizing wheeler graphs. Algorithmica, 84(3):784-814, 2022.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In International colloquium on automata, languages, and programming, pages 943-955. Springer, 2003.
Ben Langmead. Aligning short sequencing reads with Bowtie. Current protocols in bioinformatics, 32(1):11-7, 2010.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357-359, 2012.
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
Felipe A Louza, Simon Gog, and Guilherme P Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678:22-39, 2017.
Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems (TOIS), 31(3):1-15, 2013.
Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In 2009 data compression conference, pages 193-202. IEEE, 2009.
Alberto Policriti and Nicola Prezza. From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), volume 78 of Leibniz International Proceedings in Informatics (LIPIcs), pages 17:1-17:10, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.17.
Jarno Alanko; Nicola Cotumaccio; Nicola Prezza. Linear-time minimization of Wheeler DFAs, 2022. URL: https://sigport.org/documents/linear-time-minimization-wheeler-dfas.
Nayanah Siva. 1000 genomes project. Nature biotechnology, 26(3):256-257, 2008.
Daniel Valenzuela and Veli Mäkinen. CHIC: a short read aligner for pan-genomic references. biorxiv, page 178129, 2017.

Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Authors Adrián Goga, Andrej Baláž

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Authors Adrián Goga, Andrej Baláž

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message