Compressing and Indexing Aligned Readsets

Gagie, Travis; Gourdel, Garance; Manzini, Giovanni

doi:10.4230/LIPIcs.WABI.2021.13

Abstract

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '20, page 911–930. Society for Industrial and Applied Mathematics, 2020.
Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: a succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
Uwe Baier, Thomas Büchler, Enno Ohlebusch, and Pascal Weber. Edge minimization in de Bruijn graphs. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2020, Snowbird, UT, USA, March 24-27, 2020, pages 223-232. IEEE, 2020. URL: https://doi.org/10.1109/DCC47342.2020.00030.
Hideo Bannai, Travis Gagie, and I Tomohiro. Refining the r-index. etical Computer Science, 812:96-108, 2020.
Markus J Bauer, Anthony J Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013.
Jason W Bentley, Daniel Gibney, and Sharma V Thankachan. On the complexity of BWT-runs minimization via alphabet reordering. In 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):1-15, 2019.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225-235. Springer Berlin Heidelberg, 2012.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International workshop on algorithms in bioinformatics (WABI), pages 225-235. Springer, 2012.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994.
Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. SPRING: a next-generation compressor for FASTQ data. Bioinformatics, 35(15):2674-2676, 2018. URL: https://doi.org/10.1093/bioinformatics/bty1015.
Hamidreza Chitsaz, Joyclyn Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher Dupont, Jonathan Badger, Mark Novotny, Douglas Rusch, Louise Fraser, Niall Gormley, Ole Schulz-Trieglaff, Geoffrey Smith, Dirk Evers, Pavel Pevzner, and Roger Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nature biotechnology, 29:915-21, September 2011. URL: https://doi.org/10.1038/nbt.1966.
Anthony J Cox, Markus J Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11):1415-1419, 2012.
Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinform., 28(11):1415-1419, 2012. URL: https://doi.org/10.1093/bioinformatics/bts173.
Anthony J. Cox, Tobias Jakobi, Giovanna Rosone, and Ole B. Schulz-Trieglaff. Comparing DNA sequence collections by direct comparison of compressed text indexes. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics. Springer Berlin Heidelberg, 2012.
D. Díaz-Domínguez and G. Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31th Data Compression Conference (DCC), 2021. To appear.
Dirk D Dolle, Zhicheng Liu, Matthew Cotten, Jared T Simpson, Zamin Iqbal, Richard Durbin, Shane A McCarthy, and Thomas M Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome research, 27(2):300-309, 2017.
Lavinia Egidi, Felipe A Louza, Giovanni Manzini, and Guilherme P Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14(1):1-15, 2019.
Lavinia Egidi and Giovanni Manzini. Lightweight merging of compressed indices based on BWT variants. Theoretical Computer Science, 812:214-229, 2020.
Héctor Ferrada, Travis Gagie, Tommi Hirvola, and Simon J Puglisi. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2016):20130137, 2014.
Héctor Ferrada, Dominik Kempa, and Simon J Puglisi. Hybrid indexing revisited. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8. SIAM, 2018.
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and Senthilmurugan Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM (JACM), 57(1):1-33, 2009.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005.
Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical computer science, 698:67-78, 2017.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM (JACM), 67(1):1-54, 2020.
Travis Gagie and Simon J Puglisi. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3:12, 2015.
Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9):875-879, 2018.
Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the Burrows-Wheeler transform. In Tomáš Bureš, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdziński, Claus Pahl, Florian Sikora, and Prudence W.H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science, pages 249-262, Cham, 2021. Springer International Publishing.
Sara Giuliani, Zsuzsanna Lipták, Francesco Masillo, and Romeo Rizzi. When a dollar makes a BWT. Theoretical Computer Science, 2019.
Lilian Janin, Ole Schulz-Trieglaff, and Anthony J Cox. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics, 30(19):2796-2801, 2014.
Juha Kärkkäinen, Giovanni Manzini, and Simon J Puglisi. Permuted longest-common-prefix array. In Annual Symposium on Combinatorial Pattern Matching, pages 181-192. Springer, 2009.
R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
Richard M. Karp, Raymond E. Miller, and Arnold L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the Fourth Annual ACM Symposium on Theory of Computing, page 125–136, 1972. URL: https://doi.org/10.1145/800152.804905.
Alice M Kaye and Wyeth W Wasserman. The genome atlas: Navigating a new era of reference genomes. Trends in Genetics, 2021.
Yuichi Kodama, Martin Shumway, and Rasko Leinonen. The sequence read archive: explosive growth of sequencing data. Nucleic acids research, 40(D1):D54-D56, 2012.
Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. Journal of Computational Biology, 27(4):500-513, 2020.
Ben Langmead. Algorithms for DNA sequencing: Base calling and sequencing errors, May 2015. URL: https://www.youtube.com/watch?v=U4QnpciIJhM&list=PL2mpR0RYFQsBiCWVJSvVAO3OJ2t7DzoHA&index=10.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012.
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):1-10, 2009.
Heng Li. Fast construction of FM-index for long sequence reads. Bioinform., 30(22):3274-3275, 2014. URL: https://doi.org/10.1093/bioinformatics/btu541.
Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009.
Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. URL: https://doi.org/10.1093/bioinformatics/btp324.
Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678:22-39, 2017. URL: https://doi.org/10.1016/j.tcs.2017.03.039.
Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, and Steven L. Salzberg. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinform., 29(14):1718-1725, 2013. URL: https://doi.org/10.1093/bioinformatics/btt273.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-scale algorithm design. Cambridge University Press, 2015.
S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows–Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. URL: https://doi.org/10.1016/j.tcs.2007.07.014.
Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007.
Martin D Muggli, Alexander Bowe, Noelle R Noyes, Paul S Morley, Keith E Belk, Robert Raymond, Travis Gagie, Simon J Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017.
Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016.
Takaaki Nishimoto and Yasuo Tabei. Faster queries on BWT-runs compressed indexes. arXiv preprint, 2020. URL: http://arxiv.org/abs/2006.05104.
Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986-2011, 2018.
Nicola Prezza. On locating paths in compressed tries. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 744-760. SIAM, 2021.
Nicola Prezza. Subpath queries on compressed graphs: A survey. Algorithms, 14(1):14, 2021.
René Rahn, David Weese, and Knut Reinert. Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics, 30(24):3499-3505, 2014. URL: https://doi.org/10.1093/bioinformatics/btu438.
Julian Seward. bzip2 and libbzip2, 1996. avaliable at URL: http://www.bzip.org.
Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014.
Karyn Meltz Steinberg, Valerie A. Schneider, Tina A. Graves-Lindsay, Robert S. Fulton, Richa Agarwala, John Huddleston, Sergey A. Shiryev, Aleksandr Morgulis, Urvashi Surti, Wesley C. Warren, Deanna M. Church, Evan E. Eichler, and Richard K. Wilson. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Research, 24(12):2066-2076, 2014. URL: https://doi.org/10.1101/gr.180893.114.
Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC genomics, 19(2):123-130, 2018.
Aleksey V. Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, and James A. Yorke. The MaSuRCA genome assembler. Bioinformatics, 29(21):2669-2677, August 2013. URL: https://academic.oup.com/bioinformatics/article-pdf/29/21/2669/18533361/btt476.pdf.

Compressing and Indexing Aligned Readsets

Authors Travis Gagie , Garance Gourdel, Giovanni Manzini

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Compressing and Indexing Aligned Readsets

Authors Travis Gagie , Garance Gourdel, Giovanni Manzini

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References