LRBinner: Binning Long Reads in Metagenomics Datasets

Wickramarachchi, Anuradha; Lin, Yu

doi:10.4230/LIPIcs.WABI.2021.11

Abstract

Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy against the baselines. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources for assembly while attaining satisfactory assembly qualities.

Takashi Abe, Shigehiko Kanaya, Makoto Kinouchi, Yuta Ichiba, Tokio Kozuki, and Toshimichi Ikemura. Informatics for unveiling hidden genome signatures. Genome Research, 13(4):693-702, 2003. URL: http://genome.cshlp.org/content/13/4/693.full.pdf+html.
Johannes Alneberg, Brynjar Smári Bjarnason, Ino de Bruijn, et al. Binning metagenomic contigs by coverage and composition. Nature Methods, 11:1144, September 2014.
Kevin Chen and Lior Pachter. Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities. PLOS Computational Biology, 1(2), July 2005.
P J Deschavanne, A Giron, J Vilain, G Fagot, and B Fertil. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution, 16(10):1391-1399, October 1999. URL: http://oup.prod.sis.lan/mbe/article-pdf/16/10/1391/9592103/mbe1391.pdf.
Charles R. Harris, K. Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585:357–362, 2020. URL: https://doi.org/10.1038/s41586-020-2649-2.
Dongwan D. Kang, Jeff Froula, Rob Egan, and Zhong Wang. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ, 3:e1165, 2015.
Dongwan D. Kang, Feng Li, Edward Kirton, Ashleigh Thomas, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ, 7:e7359, 2019.
Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Research, 26(12):1721-1729, 2016. URL: http://genome.cshlp.org/content/26/12/1721.full.pdf+html.
Mikhail Kolmogorov, Derek M. Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy P. L. Smith, and Pavel A. Pevzner. metaflye: scalable long-read metagenome assembly using repeat graphs. Nature Methods, 17(11):1103-1110, November 2020. URL: https://doi.org/10.1038/s41592-020-00971-x.
S. Kullback and R. A. Leibler. On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1):79-86, 1951. URL: https://doi.org/10.1214/aoms/1177729694.
Cedric C. Laczny, Christina Kiefer, Valentina Galata, et al. BusyBee Web: metagenomic data analysis by bootstrapped supervised binning and annotation. Nucleic Acids Research, 45(W1):W171-W179, May 2017. URL: http://oup.prod.sis.lan/nar/article-pdf/45/W1/W171/18137403/gkx348.pdf.
Cedric C. Laczny, Nicolás Pinel, Nikos Vlassis, and Paul Wilmes. Alignment-free Visualization of Metagenomic Data by Nonlinear Dimension Reduction. Scientific Reports, 4:4516, March 2014.
Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, May 2018. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/34/18/3094/25731859/bty191.pdf.
Hsin-Hung Lin and Yu-Chieh Liao. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Scientific reports, 6:24175-24175, April 2016. 27067514[pmid]. URL: https://doi.org/10.1038/srep24175.
Peter Menzel, Kim Lee Ng, and Anders Krogh. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7:11257, April 2016. Article.
Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz, Alexander Sczyrba, and Alice C McHardy. AMBER: Assessment of Metagenome BinnERs. GigaScience, 7(6), June 2018. giy069. URL: https://doi.org/10.1093/gigascience/giy069.
Alla Mikheenko, Vladislav Saveliev, and Alexey Gurevich. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics, 32(7):1088-1090, November 2015. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/32/7/1088/19568745/btv697.pdf.
Jakob Nybo Nissen, Joachim Johansen, Rosa Lundbye Allesøe, Casper Kaae Sønderby, Jose Juan Almagro Armenteros, Christopher Heje Grønbech, Lars Juhl Jensen, Henrik Bjørn Nielsen, Thomas Nordahl Petersen, Ole Winther, and Simon Rasmussen. Improved metagenome binning and assembly using deep variational autoencoders. Nature Biotechnology, January 2021. URL: https://doi.org/10.1038/s41587-020-00777-4.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024-8035. Curran Associates, Inc., 2019. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
David Pellow, Itzik Mizrahi, and Ron Shamir. Plasclass improves plasmid sequence classification. PLOS Computational Biology, 16(4):1-9, April 2020. URL: https://doi.org/10.1371/journal.pcbi.1007781.
Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, January 2013. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/29/5/652/702231/btt020.pdf.
Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2. Nature Methods, 17(2):155-158, 2020. URL: https://doi.org/10.1038/s41592-019-0669-3.
Bianca K. Stöcker, Johannes Köster, and Sven Rahmann. SimLoRD: Simulation of Long Read Data. Bioinformatics, 32(17):2704-2706, May 2016. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/32/17/2704/17346032/btw286.pdf.
Ziye Wang, Zhengyang Wang, Yang Young Lu, et al. SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics, April 2019. btz253. URL: http://oup.prod.sis.lan/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz253/28579442/btz253.pdf.
Aaron M. Wenger, Paul Peluso, William J. Rowell, Pi-Chuan Chang, Richard J. Hall, Gregory T. Concepcion, Jana Ebler, Arkarachai Fungtammasan, Alexey Kolesnikov, Nathan D. Olson, Armin Töpfer, Michael Alonge, Medhat Mahmoud, Yufeng Qian, Chen-Shan Chin, Adam M. Phillippy, Michael C. Schatz, Gene Myers, Mark A. DePristo, Jue Ruan, Tobias Marschall, Fritz J. Sedlazeck, Justin M. Zook, Heng Li, Sergey Koren, Andrew Carroll, David R. Rank, and Michael W. Hunkapiller. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology, 37(10):1155-1162, October 2019. URL: https://doi.org/10.1038/s41587-019-0217-9.
Anuradha Wickramarachchi, Vijini Mallawaarachchi, Vaibhav Rajan, and Yu Lin. MetaBCC-LR: metagenomics binning by coverage and composition for long reads. Bioinformatics, 36(Supplement_1):i3-i11, July 2020. URL: https://doi.org/10.1093/bioinformatics/btaa441.
Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):R46, 2014.
Yu-Wei Wu, Blake A. Simmons, and Steven W. Singer. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics, 32(4):605–607, October 2015.
Yu-Wei Wu, Yung-Hsu Tang, Susannah G. Tringe, et al. Maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2(1):26, 2014.
Guoxian Yu, Yuan Jiang, Jun Wang, et al. BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage. Bioinformatics, 34(24):4172-4179, June 2018. URL: http://oup.prod.sis.lan/bioinformatics/article-pdf/34/24/4172/27088792/bty519.pdf.

LRBinner: Binning Long Reads in Metagenomics Datasets

Authors Anuradha Wickramarachchi , Yu Lin

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

LRBinner: Binning Long Reads in Metagenomics Datasets

Authors Anuradha Wickramarachchi , Yu Lin

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References