Economic Genome Assembly from Low Coverage Illumina and Nanopore Data

Authors Thomas Gatter , Sarah von Löhneysen, Polina Drozdova , Tom Hartmann , Peter F. Stadler



PDF
Thumbnail PDF

File

LIPIcs.WABI.2020.10.pdf
  • Filesize: 3.71 MB
  • 22 pages

Document Identifiers

Author Details

Thomas Gatter
  • Bioinformatics Group, Department of Computer Science, University of Leipzig, Germany
  • Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany
Sarah von Löhneysen
  • Bioinformatics Group, Department of Computer Science, University of Leipzig, Germany
  • Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany
Polina Drozdova
  • Institute of Biology, Irkutsk State University, Russia
Tom Hartmann
  • Bioinformatics Group, Department of Computer Science, University of Leipzig, Germany
  • Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany
Peter F. Stadler
  • Bioinformatics Group, Department of Computer Science, University of Leipzig, Germany
  • Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany
  • Max-Planck-Institute for Mathematics in the Sciences, Leipzig, Germany
  • Institut for Theoretical Chemistry, University of Vienna, Austria
  • Facultad de Ciencias, Universidad National de Colombia, Bogotá, Colombia
  • Santa Fe Institute, NM, USA

Cite AsGet BibTex

Thomas Gatter, Sarah von Löhneysen, Polina Drozdova, Tom Hartmann, and Peter F. Stadler. Economic Genome Assembly from Low Coverage Illumina and Nanopore Data. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 10:1-10:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.WABI.2020.10

Abstract

Ongoing developments in genome sequencing have caused a fundamental paradigm shift in the field in recent years. With ever lower sequencing costs, projects are no longer limited by available raw data, but rather by computational demands. The high complexity of eukaryotic genomes in concordance with increasing data sizes creates unique demands on methods to assemble full genomes. We describe a new approach to assemble genomes from a combination of low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs, which are then reduced to a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Our findings demonstrate a new low-cost method that enables the assembly of even large genomes with low computational effort.

Subject Classification

ACM Subject Classification
  • Theory of computation → Discrete optimization
  • Applied computing → Computational genomics
Keywords
  • Nanopore sequencing
  • Illumina sequencing
  • genome assembly
  • spanning tree
  • unitigs
  • anchors

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho, Michael R. Garey, and Jeffrey D. Ullman. The transitive reduction of a directed graph. SIAM Journal on Computing, 1:131-137, 1972. URL: https://doi.org/10.1137/0201008.
  2. Dmitry Antipov, Anton Korobeynikov, Jeffrey S McLean, and Pavel A Pevzner. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 32:1009-1015, 2016. URL: https://doi.org/10.1093/bioinformatics/btv688.
  3. Ali Baharev, Hermann Schichl, and Arnold Neumaier. An exact method for the minimum feedback arc set problem. Technical report, University of Vienna, 2015. Google Scholar
  4. Ravi Boppana and Magnús M. Halldórsson. Approximating maximum independent sets by excluding subgraphs. BIT Numerical Mathematics, 32:180-196, 1992. URL: https://doi.org/10.1007/BF01994876.
  5. Chen-Shan Chin, Paul Peluso, Fritz J Sedlazeck, Maria Nattestad, Gregory T Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R. Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R. Ecker, Dario Cantu, David R. Rank, and Michael C. Schatz. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods, 13:1050-1054, 2016. URL: https://doi.org/10.1038/nmeth.4035.
  6. Yun Sung Cho, Hyunho Kim, Hak-Min Kim, Sungwoong Jho, JeHoon Jun, Yong Joo Lee, Kyun Shik Chae, Chang Geun Kim, Sangsoo Kim, Anders Eriksson, et al. An ethnically relevant consensus korean reference genome is a step towards personal reference genomes. Nature Comm., 7:13637, 2016. URL: https://doi.org/10.1038/ncomms13637.
  7. Alex Di Genova, Elena Buena-Atienza, Stephan Ossowski, and Marie-France Sagot. WENGAN: Efficient and high quality hybrid de novo assembly of human genomes. Technical Report 840447, bioRxiv, 2019. URL: https://doi.org/10.1101/840447.
  8. Kunal Dutta and C. R. Subramanian. Induced acyclic tournaments in random digraphs: sharp concentration, thresholds and algorithms. Discussiones Mathematicae Graph Theory, 34:467-495, 2014. URL: https://doi.org/10.7151/dmgt.1758.
  9. Francesca Giordano, Louise Aigrain, Michael A. Quail, Paul Coupland, James K. Bonfield, Robert M. Davies, German Tischler, David K. Jackson, Thomas M. Keane, Jing Li, Jia-Xing Yue, Gianni Liti, Richard Durbin, and Zemin Ning. De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Scientific Reports, 7:1-10, 2017. URL: https://doi.org/10.1038/s41598-017-03996-z.
  10. Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29:1072-1075, 2013. URL: https://doi.org/10.1093/bioinformatics/btt086.
  11. Ehsan Haghshenas, Hossein Asghari, Jens Stoye, Cedric Chauve, and Faraz Hach. HASLR: Fast hybrid assembly of long reads. Technical Report 921817, bioRxiv, 2020. URL: https://doi.org/10.1101/2020.01.27.921817.
  12. Pinar Heggernes, Daniel Meister, and Charis Papadopoulos. A new representation of proper interval graphs with an application to clique-width. Electronic Notes in Discrete Mathematics, 32:27-34, 2009. URL: https://doi.org/10.1016/j.endm.2009.02.005.
  13. Hans J. Jansen, Michael Liem, Susanne A. Jong-Raadsen, Sylvie Dufour, Finn-Arne Weltzien, William Swinkels, Alex Koelewijn, Arjan P. Palstra, Bernd Pelster, Herman P. Spaink, et al. Rapid de novo assembly of the European eel genome from nanopore sequencing reads. Scientific reports, 7:7213, 2017. URL: https://doi.org/10.1038/s41598-017-07650-6.
  14. Arthur B. Kahn. Topological sorting of large networks. Communications of the ACM, 5:558-562, 1962. URL: https://doi.org/10.1145/368996.369025.
  15. Telikepalli Kavitha, Christian Liebchen, Kurt Mehlhorn, Dimitrios Michail, Romeo Rizzi, Torsten Ueckerdt, and Katharina A. Zweig. Cycle bases in graphs: characterization, algorithms, complexity, and applications. Computer Science Review, 3:199-243, 2009. URL: https://doi.org/10.1016/j.cosrev.2009.08.001.
  16. Hui-Su Kim, Sungwon Jeon, Changjae Kim, Yeon Kyung Kim, Yun Sung Cho, Jungeun Kim, Asta Blazyte, Andrea Manica, Semin Lee, and Jong Bhak. Chromosome-scale assembly comparison of the korean reference genome KOREF from PromethION and PacBio with Hi-C mapping information. GigaScience, 8:giz125, 2019. URL: https://doi.org/10.1093/gigascience/giz125.
  17. Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. Assembly of long, error-prone reads using repeat graphs. Nature Biotech., 37:540-546, 2019. URL: https://doi.org/10.1038/s41587-019-0072-8.
  18. Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research, 27:722-736, 2017. URL: https://doi.org/10.1101/gr.215087.116.
  19. Joseph B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7:48-50, 1956. URL: https://doi.org/10.1090/S0002-9939-1956-0078686-7.
  20. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100, 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
  21. Zhenyu Li, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, Nan Li, Xuesong Hu, Binghang Liu, Bicheng Yang, and Wei Fan. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Briefings Funct. Genomics, 11:25-37, 2012. URL: https://doi.org/10.1093/bfgp/elr035.
  22. Pierre Marijon, Rayan Chikhi, and Jean-Stéphane Varré. yacrd and fpa: upstream tools for long-read genome assembly. Technical Report 674036, bioRxiv, 2019. URL: https://doi.org/10.1101/674036.
  23. Samuel Martin and Richard M. Leggett. Alvis: a tool for contig and read ALignment VISualisation and chimera detection. Technical Report 663401, BioRxiv, 2019. URL: https://doi.org/10.1101/663401.
  24. George B. Mertzios. A matrix characterization of interval and proper interval graphs. Applied Mathematics Letters, 21:332-337, 2008. URL: https://doi.org/10.1016/j.aml.2007.04.001.
  25. Burkhard Morgenstern. A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences. Applied Mathematics Letters, 15:11-16, 2002. URL: https://doi.org/10.1016/S0893-9659(01)00085-4.
  26. Sergej Nowoshilow, Siegfried Schloissnig, Ji-Feng Fei, Andreas Dahl, Andy WC. Pang, Martin Pippel, Sylke Winkler, Alex R. Hastie, George Young, Juliana G. Roscito, Francisco Falcon, Dunja Knapp, Sean Powell, Alfredo Cruz, Han Cao, Bianca Habermann, Michael Hiller, Elly M. Tanaka, and Eugene W. Myers. The axolotl genome and the evolution of key tissue formation regulators. Nature, 554:50-55, 2018. URL: https://doi.org/10.1038/nature25458.
  27. Sergey Nurk, Brian P. Walenz, Arang Rhie, Mitchell R. Vollger, Glennis A. Logsdon, Robert Grothe, Karen H. Miga, Evan E. Eichler, Adam M. Phillippy, and Sergey Koren. Hicanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. bioRxiv, 2020. URL: https://doi.org/10.1101/2020.03.14.992248.
  28. Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2. Nature Methods, 17:155-158, 2020. URL: https://doi.org/10.1038/s41592-019-0669-3.
  29. Kishwar Shafin, Trevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, Fritz J. Sedlazeck, Tobias Marschall, Simon Mayes, Vania Costa, Justin M. Zook, Kelvin J. Liu, Duncan Kilburn, Melanie Sorensen, Katy M. Munson, Mitchell R. Vollger, Evan E. Eichler, Sofie Salama, David Haussler, Richard E. Green, Mark Akeson, Adam Phillippy, Karen H. Miga, Paolo Carnevali, Miten Jain, and Benedict Paten. Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit. Technical Report 715722, BioRxiv, 2019. URL: https://doi.org/10.1101/715722.
  30. Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven JM. Jones, and Inanç Birol. ABySS: a parallel assembler for short read sequence data. Genome Research, 19:1117-1123, 2009. URL: https://doi.org/10.1101/gr.089532.108.
  31. Edwin A. Solares, Mahul Chakraborty, Danny E. Miller, Shannon Kalsow, Kate Hall, Anoja G. Perera, JJ. Emerson, and R. Scott Hawley. Rapid low-cost assembly of the Drosophila melanogaster reference genome using low-coverage, long-read sequencing. G3: Genes, Genomes, Genetics, 8:3143-3154, 2018. URL: https://doi.org/10.1534/g3.118.200162.
  32. Robert Vaser, Ivan Sović, Niranjan Nagarajan, and Mile Šikić. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research, 27:737-746, 2017. URL: https://doi.org/10.1101/gr.214270.116.
  33. Ryan R. Wick, Louise M. Judd, Claire L. Gorrie, and Kathryn E. Holt. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology, 13:e1005595, 2017. URL: https://doi.org/10.1371/journal.pcbi.1005595.
  34. Ryan R. Wick, Louise M. Judd, and Kathryn E. Holt. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. PLOS Computational Biology, 14:e1006583, 2018. URL: https://doi.org/10.1371/journal.pcbi.1006583.
  35. Chengxi Ye, Christopher M Hill, Shigang Wu, Jue Ruan, and Zhanshan Sam Ma. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Scientific reports, 6:31900, 2016. URL: https://doi.org/10.1038/srep31900.
  36. Aleksey V Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L Salzberg, and James A Yorke. The MaSuRCA genome assembler. Bioinformatics, 29:2669-2677, 2013. URL: https://doi.org/10.1093/bioinformatics/btt476.