Compressing and Indexing Aligned Readsets

Authors Travis Gagie , Garance Gourdel, Giovanni Manzini



PDF
Thumbnail PDF

File

LIPIcs.WABI.2021.13.pdf
  • Filesize: 0.9 MB
  • 21 pages

Document Identifiers

Author Details

Travis Gagie
  • Dalhousie University, Halifax, Canada
Garance Gourdel
  • IRISA - Inria Rennes - Université Rennes 1 - ENS, France
Giovanni Manzini
  • University of Pisa, Italy

Acknowledgements

Many thanks to Jarno Alanko and Uwe Baier for their XBWT-construction software, and to Diego Díaz, Richard Durbin, Filippo Geraci, Giuseppe Italiano, Ben Langmead, Gonzalo Navarro, Pierre Peterlongo, Nicola Prezza, Giovanna Rosone, Jared Simpson, Jouni Sirén and Jan Studený for helpful discussions.

Cite AsGet BibTex

Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and Indexing Aligned Readsets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 13:1-13:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.WABI.2021.13

Abstract

Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual’s uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset’s structure. For example, if we have already assembled an individual’s genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • data compression
  • compact data structures
  • FM-index
  • Burrows-Wheeler Transform
  • EBWT
  • XBWT
  • DNA reads

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '20, page 911–930. Society for Industrial and Applied Mathematics, 2020. Google Scholar
  2. Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: a succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  3. Uwe Baier, Thomas Büchler, Enno Ohlebusch, and Pascal Weber. Edge minimization in de Bruijn graphs. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2020, Snowbird, UT, USA, March 24-27, 2020, pages 223-232. IEEE, 2020. URL: https://doi.org/10.1109/DCC47342.2020.00030.
  4. Hideo Bannai, Travis Gagie, and I Tomohiro. Refining the r-index. etical Computer Science, 812:96-108, 2020. Google Scholar
  5. Markus J Bauer, Anthony J Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013. Google Scholar
  6. Jason W Bentley, Daniel Gibney, and Sharma V Thankachan. On the complexity of BWT-runs minimization via alphabet reordering. In 28th Annual European Symposium on Algorithms (ESA 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020. Google Scholar
  7. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):1-15, 2019. Google Scholar
  8. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225-235. Springer Berlin Heidelberg, 2012. Google Scholar
  9. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International workshop on algorithms in bioinformatics (WABI), pages 225-235. Springer, 2012. Google Scholar
  10. Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994. Google Scholar
  11. Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. SPRING: a next-generation compressor for FASTQ data. Bioinformatics, 35(15):2674-2676, 2018. URL: https://doi.org/10.1093/bioinformatics/bty1015.
  12. Hamidreza Chitsaz, Joyclyn Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher Dupont, Jonathan Badger, Mark Novotny, Douglas Rusch, Louise Fraser, Niall Gormley, Ole Schulz-Trieglaff, Geoffrey Smith, Dirk Evers, Pavel Pevzner, and Roger Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nature biotechnology, 29:915-21, September 2011. URL: https://doi.org/10.1038/nbt.1966.
  13. Anthony J Cox, Markus J Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11):1415-1419, 2012. Google Scholar
  14. Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinform., 28(11):1415-1419, 2012. URL: https://doi.org/10.1093/bioinformatics/bts173.
  15. Anthony J. Cox, Tobias Jakobi, Giovanna Rosone, and Ole B. Schulz-Trieglaff. Comparing DNA sequence collections by direct comparison of compressed text indexes. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics. Springer Berlin Heidelberg, 2012. Google Scholar
  16. D. Díaz-Domínguez and G. Navarro. A grammar compressor for collections of reads with applications to the construction of the BWT. In Proc. 31th Data Compression Conference (DCC), 2021. To appear. Google Scholar
  17. Dirk D Dolle, Zhicheng Liu, Matthew Cotten, Jared T Simpson, Zamin Iqbal, Richard Durbin, Shane A McCarthy, and Thomas M Keane. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome research, 27(2):300-309, 2017. Google Scholar
  18. Lavinia Egidi, Felipe A Louza, Giovanni Manzini, and Guilherme P Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14(1):1-15, 2019. Google Scholar
  19. Lavinia Egidi and Giovanni Manzini. Lightweight merging of compressed indices based on BWT variants. Theoretical Computer Science, 812:214-229, 2020. Google Scholar
  20. Héctor Ferrada, Travis Gagie, Tommi Hirvola, and Simon J Puglisi. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2016):20130137, 2014. Google Scholar
  21. Héctor Ferrada, Dominik Kempa, and Simon J Puglisi. Hybrid indexing revisited. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8. SIAM, 2018. Google Scholar
  22. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and Senthilmurugan Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM (JACM), 57(1):1-33, 2009. Google Scholar
  23. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005. Google Scholar
  24. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical computer science, 698:67-78, 2017. Google Scholar
  25. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM (JACM), 67(1):1-54, 2020. Google Scholar
  26. Travis Gagie and Simon J Puglisi. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3:12, 2015. Google Scholar
  27. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9):875-879, 2018. Google Scholar
  28. Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the Burrows-Wheeler transform. In Tomáš Bureš, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdziński, Claus Pahl, Florian Sikora, and Prudence W.H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science, pages 249-262, Cham, 2021. Springer International Publishing. Google Scholar
  29. Sara Giuliani, Zsuzsanna Lipták, Francesco Masillo, and Romeo Rizzi. When a dollar makes a BWT. Theoretical Computer Science, 2019. Google Scholar
  30. Lilian Janin, Ole Schulz-Trieglaff, and Anthony J Cox. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics, 30(19):2796-2801, 2014. Google Scholar
  31. Juha Kärkkäinen, Giovanni Manzini, and Simon J Puglisi. Permuted longest-common-prefix array. In Annual Symposium on Combinatorial Pattern Matching, pages 181-192. Springer, 2009. Google Scholar
  32. R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
  33. Richard M. Karp, Raymond E. Miller, and Arnold L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the Fourth Annual ACM Symposium on Theory of Computing, page 125–136, 1972. URL: https://doi.org/10.1145/800152.804905.
  34. Alice M Kaye and Wyeth W Wasserman. The genome atlas: Navigating a new era of reference genomes. Trends in Genetics, 2021. Google Scholar
  35. Yuichi Kodama, Martin Shumway, and Rasko Leinonen. The sequence read archive: explosive growth of sequencing data. Nucleic acids research, 40(D1):D54-D56, 2012. Google Scholar
  36. Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. Journal of Computational Biology, 27(4):500-513, 2020. Google Scholar
  37. Ben Langmead. Algorithms for DNA sequencing: Base calling and sequencing errors, May 2015. URL: https://www.youtube.com/watch?v=U4QnpciIJhM&list=PL2mpR0RYFQsBiCWVJSvVAO3OJ2t7DzoHA&index=10.
  38. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012. Google Scholar
  39. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):1-10, 2009. Google Scholar
  40. Heng Li. Fast construction of FM-index for long sequence reads. Bioinform., 30(22):3274-3275, 2014. URL: https://doi.org/10.1093/bioinformatics/btu541.
  41. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  42. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. URL: https://doi.org/10.1093/bioinformatics/btp324.
  43. Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678:22-39, 2017. URL: https://doi.org/10.1016/j.tcs.2017.03.039.
  44. Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, and Steven L. Salzberg. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinform., 29(14):1718-1725, 2013. URL: https://doi.org/10.1093/bioinformatics/btt273.
  45. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-scale algorithm design. Cambridge University Press, 2015. Google Scholar
  46. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows–Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. URL: https://doi.org/10.1016/j.tcs.2007.07.014.
  47. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. Google Scholar
  48. Martin D Muggli, Alexander Bowe, Noelle R Noyes, Paul S Morley, Keith E Belk, Robert Raymond, Travis Gagie, Simon J Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017. Google Scholar
  49. Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016. Google Scholar
  50. Takaaki Nishimoto and Yasuo Tabei. Faster queries on BWT-runs compressed indexes. arXiv preprint, 2020. URL: http://arxiv.org/abs/2006.05104.
  51. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986-2011, 2018. Google Scholar
  52. Nicola Prezza. On locating paths in compressed tries. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 744-760. SIAM, 2021. Google Scholar
  53. Nicola Prezza. Subpath queries on compressed graphs: A survey. Algorithms, 14(1):14, 2021. Google Scholar
  54. René Rahn, David Weese, and Knut Reinert. Journaled string tree - a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics, 30(24):3499-3505, 2014. URL: https://doi.org/10.1093/bioinformatics/btu438.
  55. Julian Seward. bzip2 and libbzip2, 1996. avaliable at URL: http://www.bzip.org.
  56. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014. Google Scholar
  57. Karyn Meltz Steinberg, Valerie A. Schneider, Tina A. Graves-Lindsay, Robert S. Fulton, Richa Agarwala, John Huddleston, Sergey A. Shiryev, Aleksandr Morgulis, Urvashi Surti, Wesley C. Warren, Deanna M. Church, Evan E. Eichler, and Richard K. Wilson. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Research, 24(12):2066-2076, 2014. URL: https://doi.org/10.1101/gr.180893.114.
  58. Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC genomics, 19(2):123-130, 2018. Google Scholar
  59. Aleksey V. Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, and James A. Yorke. The MaSuRCA genome assembler. Bioinformatics, 29(21):2669-2677, August 2013. URL: https://academic.oup.com/bioinformatics/article-pdf/29/21/2669/18533361/btt476.pdf.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail