LRBinner: Binning Long Reads in Metagenomics Datasets

Authors Anuradha Wickramarachchi , Yu Lin



PDF
Thumbnail PDF

File

LIPIcs.WABI.2021.11.pdf
  • Filesize: 1.14 MB
  • 18 pages

Document Identifiers

Author Details

Anuradha Wickramarachchi
  • School of Computing, Australian National University, Canberra, Australia
Yu Lin
  • School of Computing, Australian National University, Canberra, Australia

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments. Furthermore, this research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI Australia), an NCRIS enabled capability supported by the Australian Government.

Cite AsGet BibTex

Anuradha Wickramarachchi and Yu Lin. LRBinner: Binning Long Reads in Metagenomics Datasets. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 11:1-11:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.WABI.2021.11

Abstract

Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy against the baselines. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources for assembly while attaining satisfactory assembly qualities.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Applied computing → Computational genomics
Keywords
  • Metagenomics binning
  • long reads
  • machine learning
  • clustering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Takashi Abe, Shigehiko Kanaya, Makoto Kinouchi, Yuta Ichiba, Tokio Kozuki, and Toshimichi Ikemura. Informatics for unveiling hidden genome signatures. Genome Research, 13(4):693-702, 2003. URL: http://genome.cshlp.org/content/13/4/693.full.pdf+html.
  2. Johannes Alneberg, Brynjar Smári Bjarnason, Ino de Bruijn, et al. Binning metagenomic contigs by coverage and composition. Nature Methods, 11:1144, September 2014. Google Scholar
  3. Kevin Chen and Lior Pachter. Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities. PLOS Computational Biology, 1(2), July 2005. Google Scholar
  4. P J Deschavanne, A Giron, J Vilain, G Fagot, and B Fertil. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution, 16(10):1391-1399, October 1999. URL: http://oup.prod.sis.lan/mbe/article-pdf/16/10/1391/9592103/mbe1391.pdf.
  5. Charles R. Harris, K. Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585:357–362, 2020. URL: https://doi.org/10.1038/s41586-020-2649-2.
  6. Dongwan D. Kang, Jeff Froula, Rob Egan, and Zhong Wang. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ, 3:e1165, 2015. Google Scholar
  7. Dongwan D. Kang, Feng Li, Edward Kirton, Ashleigh Thomas, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ, 7:e7359, 2019. Google Scholar
  8. Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Research, 26(12):1721-1729, 2016. URL: http://genome.cshlp.org/content/26/12/1721.full.pdf+html.
  9. Mikhail Kolmogorov, Derek M. Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy P. L. Smith, and Pavel A. Pevzner. metaflye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11):1103-1110, November 2020. URL: https://doi.org/10.1038/s41592-020-00971-x.
  10. S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79-86, 1951. URL: https://doi.org/10.1214/aoms/1177729694.
  11. Cedric C. Laczny, Christina Kiefer, Valentina Galata, et al. BusyBee Web: metagenomic data analysis by bootstrapped supervised binning and annotation. Nucleic Acids Research, 45(W1):W171-W179, May 2017. URL: http://oup.prod.sis.lan/nar/article-pdf/45/W1/W171/18137403/gkx348.pdf.
  12. Cedric C. Laczny, Nicolás Pinel, Nikos Vlassis, and Paul Wilmes. Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction. Scientific Reports, 4:4516, March 2014. Google Scholar
  13. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, May 2018. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/34/18/3094/25731859/bty191.pdf.
  14. Hsin-Hung Lin and Yu-Chieh Liao. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Scientific reports, 6:24175-24175, April 2016. 27067514[pmid]. URL: https://doi.org/10.1038/srep24175.
  15. Peter Menzel, Kim Lee Ng, and Anders Krogh. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7:11257, April 2016. Article. Google Scholar
  16. Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz, Alexander Sczyrba, and Alice C McHardy. AMBER: Assessment of Metagenome BinnERs. GigaScience, 7(6), June 2018. giy069. URL: https://doi.org/10.1093/gigascience/giy069.
  17. Alla Mikheenko, Vladislav Saveliev, and Alexey Gurevich. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics, 32(7):1088-1090, November 2015. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/32/7/1088/19568745/btv697.pdf.
  18. Jakob Nybo Nissen, Joachim Johansen, Rosa Lundbye Allesøe, Casper Kaae Sønderby, Jose Juan Almagro Armenteros, Christopher Heje Grønbech, Lars Juhl Jensen, Henrik Bjørn Nielsen, Thomas Nordahl Petersen, Ole Winther, and Simon Rasmussen. Improved metagenome binning and assembly using deep variational autoencoders. Nature Biotechnology, January 2021. URL: https://doi.org/10.1038/s41587-020-00777-4.
  19. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024-8035. Curran Associates, Inc., 2019. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  20. David Pellow, Itzik Mizrahi, and Ron Shamir. Plasclass improves plasmid sequence classification. PLOS Computational Biology, 16(4):1-9, April 2020. URL: https://doi.org/10.1371/journal.pcbi.1007781.
  21. Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, January 2013. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/29/5/652/702231/btt020.pdf.
  22. Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2. Nature Methods, 17(2):155-158, 2020. URL: https://doi.org/10.1038/s41592-019-0669-3.
  23. Bianca K. Stöcker, Johannes Köster, and Sven Rahmann. SimLoRD: Simulation of Long Read Data. Bioinformatics, 32(17):2704-2706, May 2016. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/32/17/2704/17346032/btw286.pdf.
  24. Ziye Wang, Zhengyang Wang, Yang Young Lu, et al. SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics, April 2019. btz253. URL: http://oup.prod.sis.lan/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz253/28579442/btz253.pdf.
  25. Aaron M. Wenger, Paul Peluso, William J. Rowell, Pi-Chuan Chang, Richard J. Hall, Gregory T. Concepcion, Jana Ebler, Arkarachai Fungtammasan, Alexey Kolesnikov, Nathan D. Olson, Armin Töpfer, Michael Alonge, Medhat Mahmoud, Yufeng Qian, Chen-Shan Chin, Adam M. Phillippy, Michael C. Schatz, Gene Myers, Mark A. DePristo, Jue Ruan, Tobias Marschall, Fritz J. Sedlazeck, Justin M. Zook, Heng Li, Sergey Koren, Andrew Carroll, David R. Rank, and Michael W. Hunkapiller. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology, 37(10):1155-1162, October 2019. URL: https://doi.org/10.1038/s41587-019-0217-9.
  26. Anuradha Wickramarachchi, Vijini Mallawaarachchi, Vaibhav Rajan, and Yu Lin. MetaBCC-LR: metagenomics binning by coverage and composition for long reads. Bioinformatics, 36(Supplement_1):i3-i11, July 2020. URL: https://doi.org/10.1093/bioinformatics/btaa441.
  27. Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):R46, 2014. Google Scholar
  28. Yu-Wei Wu, Blake A. Simmons, and Steven W. Singer. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics, 32(4):605–607, October 2015. Google Scholar
  29. Yu-Wei Wu, Yung-Hsu Tang, Susannah G. Tringe, et al. Maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2(1):26, 2014. Google Scholar
  30. Guoxian Yu, Yuan Jiang, Jun Wang, et al. BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage. Bioinformatics, 34(24):4172-4179, June 2018. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/34/24/4172/27088792/bty519.pdf.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail