Compressed Weighted de Bruijn Graphs

Authors Giuseppe F. Italiano , Nicola Prezza , Blerina Sinaimeri , Rossano Venturini



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.16.pdf
  • Filesize: 0.76 MB
  • 16 pages

Document Identifiers

Author Details

Giuseppe F. Italiano
  • Luiss University, Rome, Italy
  • Erable, INRIA Grenoble Rhône-Alpes, France
Nicola Prezza
  • DAIS, Ca' Foscari University of Venice, Italy
Blerina Sinaimeri
  • Luiss University, Rome, Italy
  • Erable, INRIA Grenoble Rhône-Alpes, France
Rossano Venturini
  • Dipartimento di Informatica, Università di Pisa, Pisa, Italy

Cite AsGet BibTex

Giuseppe F. Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini. Compressed Weighted de Bruijn Graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CPM.2021.16

Abstract

We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Data structures design and analysis
Keywords
  • weighted de Bruijn graphs
  • k-mer annotation
  • compressed data structures
  • partial sums

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Usama Ashraf, Clara Benoit-Pilven, Vincent Navratil, Cécile Ligneau, Guillaume Fournier, Sandie Munier, Odile Sismeiro, Jean-Yves Coppée, Vincent Lacroix, and Nadia Naffakh. Influenza virus infection induces widespread alterations of host cell splicing. NAR Genomics and Bioinformatics, 2(4), November 2020. Google Scholar
  2. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. Spades: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455-477, 2012. PMID: 22506599. URL: https://doi.org/10.1089/cmb.2012.0021.
  3. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn Graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225-235, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. Google Scholar
  4. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  5. Timothy M. Chan, Meng He, J. Ian Munro, and Gelin Zhou. Succinct indices for path minimum, with applications. Algorithmica, 78(2):453–491, 2017. URL: https://doi.org/10.1007/s00453-016-0170-7.
  6. Graham Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proceedings of the 2005 SIAM International Conference on Data Mining, SDM 2005, Newport Beach, CA, USA, April 21-23, 2005, pages 44-55, 2005. URL: https://doi.org/10.1137/1.9781611972757.5.
  7. O'Neil Delpratt, Naila Rahman, and Rajeev Raman. Compressed prefix sums. In Proceedings of the 33rd Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM), volume 4362 of Lecture Notes in Computer Science, pages 235-247. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-69507-3_19.
  8. Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, January 2015. URL: https://doi.org/10.1093/bioinformatics/btv022.
  9. Jack Edmonds. Optimum branchings. Journal of Research of the national Bureau of Standards Section B, 71(4):233-240, 1967. Google Scholar
  10. Peter Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246–260, 1974. URL: https://doi.org/10.1145/321812.321820.
  11. Robert Mario Fano. On the number of bits required to implement an associative memory. memorandum 61. Computer Structures Group, Project MAC, MIT, Cambridge, Mass., nd, 1971. Google Scholar
  12. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. URL: https://doi.org/10.1145/1082036.1082039.
  13. Harold N Gabow, Zvi Galil, Thomas Spencer, and Robert E Tarjan. Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica, 6(2):109-122, 1986. Google Scholar
  14. Richard F Geary, Rajeev Raman, and Venkatesh Raman. Succinct ordinal trees with level-ancestor queries. ACM Transactions on Algorithms (TALG), 2(4):510-534, 2006. Google Scholar
  15. Ankur Gupta, Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compressed data structures: Dictionaries and data-aware measures. Theor. Comput. Sci., 387(3):313-331, 2007. URL: https://doi.org/10.1016/j.tcs.2007.07.042.
  16. David L. Hyten, Steven B. Cannon, Qijian Song, Nathan Weeks, Edward W. Fickus, Randy C. Shoemaker, James E. Specht, Andrew D. Farmer, Gregory D. May, and Perry B. Cregan. High-throughput snp discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics, 11(1):38, 2010. URL: https://doi.org/10.1186/1471-2164-11-38.
  17. Katerina Kechris, Yee Hwa Yang, and Ru-Fang Yeh. Prediction of alternatively skipped exons and splicing enhancers from exon junction arrays. BMC Genomics, 9(1):551, 2008. URL: https://doi.org/10.1186/1471-2164-9-551.
  18. Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen, and Jun Wang. Snp detection for massively parallel whole-genome resequencing. Genome Research, 19(6):1124-1132, 2009. URL: https://doi.org/10.1101/gr.088013.108.
  19. Li Fan, Pei Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3):281-293, 2000. URL: https://doi.org/10.1109/90.851975.
  20. Leandro Lima, Blerina Sinaimeri, Gustavo Sacomoto, Hélène Lopez-Maestre, Camille Marchet, Vincent Miele, Marie-France Sagot, and Vincent Lacroix. Playing hide and seek with repeats in local and global de novo transcriptome assembly of short RNA-seq reads. Algorithms Mol Biol, 12, 2017. Google Scholar
  21. Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, and Wei Fan. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv preprint, 2013. URL: http://arxiv.org/abs/1308.2012.
  22. Yongchao Liu, Jan Schröder, and Bertil Schmidt. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics, 29(3):308-315, November 2012. URL: https://doi.org/10.1093/bioinformatics/bts690.
  23. Hélène Lopez-Maestre, Lilia Brinza, Camille Marchet, Janice Kielbassa, Sylvère Bastien, Mathilde Boutigny, David Monnin, Adil El Filali, Claudia Marcia Carareto, Cristina Vieira, Franck Picard, Natacha Kremer, Fabrice Vavre, Marie-France Sagot, and Vincent Lacroix. SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence. Nucleic Acids Research, 44(19):e148-e148, 2016. URL: https://doi.org/10.1093/nar/gkw655.
  24. Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2021. Google Scholar
  25. Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, January 2011. URL: https://doi.org/10.1093/bioinformatics/btr011.
  26. Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods, 5(7):621-628, 2008. URL: https://doi.org/10.1038/nmeth.1226.
  27. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics, 33(14):i133-i141, July 2017. URL: https://doi.org/10.1093/bioinformatics/btx261.
  28. Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, page 775–787, New York, NY, USA, 2017. Association for Computing Machinery. Google Scholar
  29. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. Squeakr: an exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, October 2017. URL: https://doi.org/10.1093/bioinformatics/btx636.
  30. Rob Patro, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4):417-419, 2017. URL: https://pubmed.ncbi.nlm.nih.gov/28263959.
  31. Pavel A. Pevzner. 1-tuple dna sequencing: Computer analysis. Journal of Biomolecular Structure and Dynamics, 7(1):63-73, 1989. PMID: 2684223. URL: https://doi.org/10.1080/07391102.1989.10507752.
  32. Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. URL: https://doi.org/10.1145/1290672.1290680.
  33. Yoshihiro Shibuya and Gregory Kucherov. Set-min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. SeqBIM 2020, 2020. URL: https://doi.org/10.1101/2020.11.14.382713.
  34. Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362-391, 1983. URL: https://doi.org/10.1016/0022-0000(83)90006-5.
  35. Raluca Uricaru, Guillaume Rizk, Vincent Lacroix, Elsa Quillery, Olivier Plantard, Rayan Chikhi, Claire Lemaitre, and Pierre Peterlongo. Reference-free detection of isolated SNPs. Nucleic Acids Research, 43(2):e11-e11, November 2014. URL: https://doi.org/10.1093/nar/gku1187.
  36. Reda Younsi and Dan MacLean. Using 2k + 2 bubble searches to find single nucleotide polymorphisms in k-mer graphs. Bioinformatics, 31(5):642-646, October 2014. URL: https://doi.org/10.1093/bioinformatics/btu706.
  37. Birney E. Zerbino DR. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18(5):821-9., 2008. PMID: 18349386. URL: https://doi.org/10.1101/gr.074492.107.