Rainbowfish: A Succinct Colored de Bruijn Graph Representation

Authors Fatemeh Almodaresi, Prashant Pandey, Rob Patro



PDF
Thumbnail PDF

File

LIPIcs.WABI.2017.18.pdf
  • Filesize: 0.52 MB
  • 15 pages

Document Identifiers

Author Details

Fatemeh Almodaresi
Prashant Pandey
Rob Patro

Cite As Get BibTex

Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: A Succinct Colored de Bruijn Graph Representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.18

Abstract

The colored de Bruijn graph— a variant of the de Bruijn graph which associates each edge (i.e., k-mer) with some set of colors - is an increasingly important combinatorial structure in computational biology. Iqbal et al. demonstrated the utility of this structure for representing and assembling a collection (population) of genomes, and showed how it can be used to accurately detect genetic variants. Muggli et al. introduced VARI, a representation of the colored de Bruijn graph that adopts the BOSS representation for the de Bruijn graph topology and achieves considerable savings in space over Cortex, albeit with some sacrifice in speed. The memory-efficient representation of VARI allows the colored de Bruijn graph to be constructed and analyzed for large datasets, beyond what is possible with Cortex.

In this paper, we introduce Rainbowfish, a succinct representation of the color information of the colored de Bruijn graph that reduces the space usage even further. Our representation also uses BOSS to represent the de Bruijn graph, but decomposes the color sets based on an equivalence relation and exploits the inherent skewness in the distribution of these color sets. The Rainbowfish representation is compressed based on the 0th-order entropy of the color sets, which can lead to a significant reduction in the space required to store the relevant information for each edge. In practice, Rainbowfish achieves up to a 20x improvement in space over VARI. Rainbowfish is written in C++11 and is available at https://github.com/COMBINE-lab/rainbowfish.

Subject Classification

Keywords
  • de Bruijn graph
  • succinct data structures
  • rank and select operation
  • colored de Bruijn graph

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, pages 225-235. Springer, 2012. Google Scholar
  2. Mathilde Causse, Nelly Desplat, Laura Pascual, Marie-Christine Le Paslier, Christopher Sauvage, Guillaume Bauchet, Aurélie Bérard, Rémi Bounon, Maria Tchoumakov, Dominique Brunel, et al. Whole genome resequencing in tomato reveals variation associated with introgression and breeding events. BMC genomics, 14(1):791, 2013. Google Scholar
  3. Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, 2015. Google Scholar
  4. Erwan Drezen, Guillaume Rizk, Rayan Chikhi, Charles Deltel, Claire Lemaitre, Pierre Peterlongo, and Dominique Lavenier. Gatb: Genome assembly &analysis tool box. Bioinformatics, 30(20):2959-2961, 2014. Google Scholar
  5. Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM), 21(2):246-260, 1974. Google Scholar
  6. Robert Mario Fano. On the number of bits required to implement an associative memory. Massachusetts Institute of Technology, Project MAC, 1971. Google Scholar
  7. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390-398. IEEE, 2000. Google Scholar
  8. Simon Gog. Succinct data structure library. https://github.com/simongog/sdsl-lite, 2017. [online; accessed 01-Feb-2017].
  9. Rodrigo González, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. Practical implementation of rank and select queries. In Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pages 27-38, 2005. Google Scholar
  10. J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, B. L. Aken, D. Barrell, A. Zadissa, S. Searle, I. Barnes, A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Rajan, G. Despacio-Reyes, G. Saunders, C. Steward, R. Harte, M. Lin, C. Howald, A. Tanzer, T. Derrien, J. Chrast, N. Walters, S. Balasubramanian, B. Pei, M. Tress, J. M. Rodriguez, I. Ezkurdia, J. van Baren, M. Brent, D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigo, and T. J. Hubbard. GENCODE: The reference human genome annotation for the ENCODE project. Genome Research, 22(9):1760-1774, sep 2012. URL: http://dx.doi.org/10.1101/gr.135350.111.
  11. Guillaume Holley, Roland Wittler, and Jens Stoye. Bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol., 11:3, 2016. Google Scholar
  12. Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226-232, 2012. Google Scholar
  13. Guy Jacobson. Space-efficient static trees and graphs. In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 549-554. IEEE, 1989. Google Scholar
  14. Guy Joseph Jacobson. Succinct Static Data Structures. phdthesis, Carnegie Mellon University, 1988. AAI8918056. Google Scholar
  15. Martin D. Muggli. Vari. https://github.com/cosmo-team/cosmo/tree/VARI, 02 2017. Viewed Feb 3, 2017.
  16. Martin D. Muggli, Alexander Bowe, Noelle R. Noyes, Paul Morley, Keith Belk, Robert Raymond, Travis Gagie, Simon J. Puglisi, and Christina Boucher. Succinct Colored de Bruijn Graphs. Bioinformatics, 2017. Google Scholar
  17. Noelle R. Noyes, Xiang Yang, Lyndsey M. Linke, Roberta J. Magnuson, Adam Dettenwanger, Shaun Cook, Ifigenia Geornaras, Dale E. Woerner, Sheryl P. Gow, Tim A. McAllister, et al. Resistome diversity in cattle and the environment decreases during beef production. ELife, 5:e13195, 2016. Google Scholar
  18. Nuala A. O'Leary, Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic acids research, pages D733-D745, 2015. URL: http://dx.doi.org/10.1093/nar/gkv1189.
  19. Rob Patro, Stephen M. Mount, and Carl Kingsford. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nature biotechnology, 32(5):462-464, 2014. Google Scholar
  20. Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 233-242. Society for Industrial and Applied Mathematics, 2002. Google Scholar
  21. Patrick S. Schnable, Doreen Ware, Robert S. Fulton, Joshua C. Stein, Fusheng Wei, Shiran Pasternak, Chengzhi Liang, Jianwei Zhang, Lucinda Fulton, Tina A. Graves, et al. The b73 maize genome: complexity, diversity, and dynamics. science, 326(5956):1112-1115, 2009. Google Scholar
  22. David Swarbreck, Christopher Wilks, Philippe Lamesch, Tanya Z. Berardini, Margarita Garcia-Hernandez, Hartmut Foerster, Donghui Li, Tom Meyer, Robert Muller, Larry Ploetz, et al. The arabidopsis information resource (tair): gene structure and function annotation. Nucleic acids research, 36(suppl 1):D1009-D1014, 2008. Google Scholar
  23. Tsuyoshi Tanaka, Baltazar A. Antonio, Shoshi Kikuchi, Takashi Matsumoto, Yoshiaki Nagamura, Hisataka Numa, Hiroaki Sakai, Jianzhong Wu, Takeshi Itoh, Takuji Sasaki, et al. The rice annotation project database (rap-db): 2008 update. Nucleic Acids Research, 36(Supp 1):D1028-D1033, 2008. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail