Simulating the DNA Overlap Graph in Succinct Space

Authors Diego Díaz-Domínguez , Travis Gagie , Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.CPM.2019.26.pdf
  • Filesize: 3.82 MB
  • 20 pages

Document Identifiers

Author Details

Diego Díaz-Domínguez
  • CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile\and Department of Computer Science, University of Chile, Chile
Travis Gagie
  • School of Computer Science and Telecommunications, Diego Portales University, Chile
  • CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile
Gonzalo Navarro
  • CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile
  • Department of Computer Science, University of Chile, Chile

Acknowledgements

We thank the reviewers for their helpful comments.

Cite As Get BibTex

Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro. Simulating the DNA Overlap Graph in Succinct Space. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 26:1-26:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.CPM.2019.26

Abstract

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
  • Information systems → Data compression
Keywords
  • Overlap graph
  • de Bruijn graph
  • DNA sequencing
  • Succinct ordinal trees

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004. Google Scholar
  2. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455-477, 2012. Google Scholar
  3. Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Proc. 46th Annual Symposium on the Theory of Computing (STOC), pages 148-193, 2014. Google Scholar
  4. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Constructing String Graphs In External Memory. In Proc. 14th International Workshop on Algorithms in Bioinformatics (WABI), pages 311-325, 2014. Google Scholar
  5. Christina Boucher, Alexander Bowe, Travis Gagie, Simon J Puglisi, and Kunihiko Sadakane. Variable-Order de Bruijn Graphs. In Proc. 25th Data Compression Conference (DCC), pages 383-392, 2015. Google Scholar
  6. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn Graphs. In Proc. 12th International Workshop on Algorithms in Bioinformatics (WABI), pages 225-235, 2012. Google Scholar
  7. Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5):525, 2016. Google Scholar
  8. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  9. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):201-208, 2016. Google Scholar
  10. David Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  11. Nicolaas Govert De Bruijn. A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen, 49(49):758-764, 1946. Google Scholar
  12. Gennady Denisov, Brian Walenz, Aaron L Halpern, Jason Miller, Nelson Axelrod, Samuel Levy, and Granger Sutton. Consensus generation and variant detection by Celera Assembler. Bioinformatics, 24(8):1035-1040, 2008. Google Scholar
  13. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  14. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. Google Scholar
  15. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841-850, 2003. Google Scholar
  16. Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226, 2012. Google Scholar
  17. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear Work Suffix Array Construction. Journal of the ACM, 53(6):918-936, 2006. Google Scholar
  18. Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 181-192, 2001. Google Scholar
  19. Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Linear-time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 186-199, 2003. Google Scholar
  20. Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2-4):143-156, 2005. Google Scholar
  21. Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674-1676, 2015. Google Scholar
  22. Heng Li. wgsim - Read simulator for next generation sequencing. Bioinformatics, 28:593-594, 2012. Google Scholar
  23. V. Mäkinen and G. Navarro. Succinct Suffix Arrays based on Run-Length Encoding. Nordic Journal of Computing, 12(1):40-66, 2005. Google Scholar
  24. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015. Google Scholar
  25. Eugene W Myers. The fragment assembly string graph. Bioinformatics, 21(2):79-85, 2005. Google Scholar
  26. Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016. Google Scholar
  27. Gonzalo Navarro and Veli Mäkinen. Compressed Full-Text Indexes. ACM Computing Surveys, 39(1):article 2, 2007. Google Scholar
  28. Gonzalo Navarro and Alberto Ordóñez. Faster Compressed Suffix Trees for Repetitive Collections. ACM Journal of Experimental Algorithmics, 21(1):article 1.8, 2016. Google Scholar
  29. Gonzalo Navarro and Kunihiko Sadakane. Fully-Functional Static and Dynamic Succinct Trees. ACM Transactions on Algorithms, 10(3):article 16, 2014. Google Scholar
  30. Daisuke Okanohara and Kunihiko Sadakane. A Linear-Time Burrows-Wheeler Transform using Induced Sorting. In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 90-101, 2009. Google Scholar
  31. Yu Peng, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin. IDBA-a practical iterative de Bruijn graph de novo assembler. In Proc. 14th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages 426-440, 2010. Google Scholar
  32. Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):article 43, 2007. Google Scholar
  33. Jared T Simpson and Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):367-373, 2010. Google Scholar
  34. Jared T Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 22(3):549-556, 2012. Google Scholar
  35. Daniel Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(3):821-829, 2008. Google Scholar
  36. Aleksey V Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L Salzberg, and James A Yorke. The MaSuRCA genome assembler. Bioinformatics, 29(21):2669-2677, 2013. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail