Simulating the DNA Overlap Graph in Succinct Space

Díaz-Domínguez, Diego; Gagie, Travis; Navarro, Gonzalo

doi:10.4230/LIPIcs.CPM.2019.26

File

Subject Classification

ACM Subject Classification

Applied computing → Computational biology
Information systems → Data compression

Keywords

Overlap graph
de Bruijn graph
DNA sequencing
Succinct ordinal trees

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

Cite As Get BibTex

Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro. Simulating the DNA Overlap Graph in Succinct Space. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 26:1-26:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.CPM.2019.26

Author Details

Diego Díaz-Domínguez

CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile\and Department of Computer Science, University of Chile, Chile

Travis Gagie

School of Computer Science and Telecommunications, Diego Portales University, Chile
CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile

Gonzalo Navarro

CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile
Department of Computer Science, University of Chile, Chile

References

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004.
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455-477, 2012.
Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Proc. 46th Annual Symposium on the Theory of Computing (STOC), pages 148-193, 2014.
Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Constructing String Graphs In External Memory. In Proc. 14th International Workshop on Algorithms in Bioinformatics (WABI), pages 311-325, 2014.
Christina Boucher, Alexander Bowe, Travis Gagie, Simon J Puglisi, and Kunihiko Sadakane. Variable-Order de Bruijn Graphs. In Proc. 25th Data Compression Conference (DCC), pages 383-392, 2015.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn Graphs. In Proc. 12th International Workshop on Algorithms in Bioinformatics (WABI), pages 225-235, 2012.
Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5):525, 2016.
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):201-208, 2016.
David Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996.
Nicolaas Govert De Bruijn. A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen, 49(49):758-764, 1946.
Gennady Denisov, Brian Walenz, Aaron L Halpern, Jason Miller, Nelson Axelrod, Samuel Levy, and Granger Sutton. Consensus generation and variant detection by Celera Assembler. Bioinformatics, 24(8):1035-1040, 2008.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From Theory to Practice: Plug and Play with Succinct Data Structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841-850, 2003.
Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226, 2012.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear Work Suffix Array Construction. Journal of the ACM, 53(6):918-936, 2006.
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 181-192, 2001.
Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Linear-time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 186-199, 2003.
Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2-4):143-156, 2005.
Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674-1676, 2015.
Heng Li. wgsim - Read simulator for next generation sequencing. Bioinformatics, 28:593-594, 2012.
V. Mäkinen and G. Navarro. Succinct Suffix Arrays based on Run-Length Encoding. Nordic Journal of Computing, 12(1):40-66, 2005.
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015.
Eugene W Myers. The fragment assembly string graph. Bioinformatics, 21(2):79-85, 2005.
Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016.
Gonzalo Navarro and Veli Mäkinen. Compressed Full-Text Indexes. ACM Computing Surveys, 39(1):article 2, 2007.
Gonzalo Navarro and Alberto Ordóñez. Faster Compressed Suffix Trees for Repetitive Collections. ACM Journal of Experimental Algorithmics, 21(1):article 1.8, 2016.
Gonzalo Navarro and Kunihiko Sadakane. Fully-Functional Static and Dynamic Succinct Trees. ACM Transactions on Algorithms, 10(3):article 16, 2014.
Daisuke Okanohara and Kunihiko Sadakane. A Linear-Time Burrows-Wheeler Transform using Induced Sorting. In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 90-101, 2009.
Yu Peng, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin. IDBA-a practical iterative de Bruijn graph de novo assembler. In Proc. 14th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages 426-440, 2010.
Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):article 43, 2007.
Jared T Simpson and Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):367-373, 2010.
Jared T Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 22(3):549-556, 2012.
Daniel Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(3):821-829, 2008.
Aleksey V Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L Salzberg, and James A Yorke. The MaSuRCA genome assembler. Bioinformatics, 29(21):2669-2677, 2013.

Simulating the DNA Overlap Graph in Succinct Space

Authors Diego Díaz-Domínguez , Travis Gagie , Gonzalo Navarro

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Simulating the DNA Overlap Graph in Succinct Space

Authors Diego Díaz-Domínguez , Travis Gagie , Gonzalo Navarro

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message