Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Authors Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.22.pdf
  • Filesize: 1.02 MB
  • 17 pages

Document Identifiers

Author Details

Xiaofei Carl Zang
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
Xiang Li
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
Kyle Metcalfe
  • Element Biosciences, San Diego, CA, USA
Tuval Ben-Yehezkel
  • Element Biosciences, San Diego, CA, USA
Ryan Kelley
  • Element Biosciences, San Diego, CA, USA
Mingfu Shao
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA

Acknowledgements

We thank Qimin Zhang and Qian Shi for constructive discussions and suggestions on this work.

Cite AsGet BibTex

Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, and Mingfu Shao. Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.22

Abstract

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
Keywords
  • Genome assembly
  • de Bruijn graph
  • synthetic long reads
  • anchor-guided assembly
  • LoopSeq

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477, 2012. Google Scholar
  2. Briana Benton, Stephen King, Samuel R. Greenfield, Nikhita Puthuveetil, Amy L. Reese, James Duncan, Robert Marlow, Corina Tabron, Amanda E. Pierola, David A. Yarmosh, Patrick Ford Combs, Marco A. Riojas, John Bagnoli, and Jonathan L. Jacobs. The ATCC Genome Portal: Microbial Genome Reference Standards with Data Provenance. Microbiology Resource Announcements, 10(47):e00818-21, 2023. Google Scholar
  3. Inanç Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol, Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Steven J. M. Jones. De novo transcriptome assembly with ABySS. Bioinformatics, 25(21):2872-2877, 2009. Google Scholar
  4. Anthony M. Bolger, Marc Lohse, and Bjoern Usadel. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30(15):2114-2120, 2014. Google Scholar
  5. Benjamin J. Callahan, Dmitry Grinevich, Siddhartha Thakur, Michael A. Balamotis, and Tuval Ben Yehezkel. Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome, 9(1):130, 2021. Google Scholar
  6. Haoyu Cheng, Gregory T Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods, 18(2):170-175, 2021. Google Scholar
  7. A. Dobin, C.A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski, S. Jha, P. Batut, M. Chaisson, and T.R. Gingeras. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15-21, 2013. Google Scholar
  8. David Eppstein. Finding the k shortest paths. SIAM Journal on Computing, 28(2):652-673, 1998. Google Scholar
  9. Alyssa C Frazee, Andrew E Jaffe, Ben Langmead, and Jeffrey T Leek. Polyester: simulating rna-seq datasets with differential transcript expression. Bioinformatics, 31(17):2778-2784, 2015. Google Scholar
  10. Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J. Ribeiro, Joshua N. Burton, Bruce J. Walker, Ted Sharpe, Giles Hall, Terrance P. Shea, Sean Sykes, Aaron M. Berlin, Daniel Aird, Maura Costello, Riza Daza, Louise Williams, Robert Nicol, Andreas Gnirke, Chad Nusbaum, Eric S. Lander, and David B. Jaffe. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4):1513-1518, 2011. Google Scholar
  11. Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: Quality assessment tool for genome assemblies. Bioinformatics, 29(8):1072-1075, 2013. Google Scholar
  12. Michael Hagemann-Jensen, Christoph Ziegenhain, Ping Chen, Daniel Ramsköld, Gert-Jan Hendriks, Anton J. M. Larsson, Omid R. Faridani, and Rickard Sandberg. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nature Biotechnology, 38:708-714, 2020. Google Scholar
  13. Joseph B. Hiatt, Rupali P. Patwardhan, Emily H. Turner, Choli Lee, and Jay Shendure. Parallel, tag-directed assembly of locally derived short sequence reads. Nature Methods, 7(2):119-122, 2010. Google Scholar
  14. Lewis Z. Hong, Shuzhen Hong, Han Teng Wong, Pauline PK Aw, Yan Cheng, Andreas Wilm, Paola F. de Sessions, Seng Gee Lim, Niranjan Nagarajan, Martin L. Hibberd, Stephen R. Quake, and William F. Burkholder. BAsE-Seq: A method for obtaining long viral haplotypes from short sequence reads. Genome Biology, 15(11):517, 2014. Google Scholar
  15. Michal Hozza, Tomáš Vinař, and Broňa Brejová. How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra. In Costas Iliopoulos, Simon Puglisi, and Emine Yilmaz, editors, String Processing and Information Retrieval, pages 199-209, Cham, 2015. Google Scholar
  16. Felix Kallenborn and Bertil Schmidt. CAREx: Context-aware read extension of paired-end sequencing data. BMC Bioinformatics, 25(1):186, 2024. Google Scholar
  17. Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel Pevzner. Assembly of long error-prone reads using repeat graphs. Nature Biotechnology, 37:540-546, 2019. Google Scholar
  18. Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research, 27(5):722-736, 2017. Google Scholar
  19. Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674-1676, 2015. Google Scholar
  20. Xiang Li and Mingfu Shao. On de novo Bridging Paired-end RNA-seq Data. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB '23, pages 1-5, New York, NY, USA, 2023. Google Scholar
  21. Silvia Liu, Indira Wu, Yan-Ping Yu, Michael Balamotis, Baoguo Ren, Tuval Ben Yehezkel, and Jian-Hua Luo. Targeted transcriptome analysis using synthetic long read sequencing uncovers isoform reprograming in the progression of colon cancer. Communications Biology, 4(1):1-11, 2021. Google Scholar
  22. Lauren Mak, Dmitry Meleshko, David C. Danko, Waris N. Barakzai, Salil Maharjan, Natan Belchikov, and Iman Hajirasouliha. Ariadne: Synthetic long read deconvolution using assembly graphs. Genome Biology, 24(1):197, 2023. Google Scholar
  23. Simone Picelli, Omid R Faridani, Åsa K Björklund, Gösta Winberg, Sven Sagasser, and Rickard Sandberg. Full-length rna-seq from single cells using Smart-seq2. Nature Protocols, 9:171-181, 2014. Google Scholar
  24. Michael A. Schon, Stefan Lutzmayer, Falko Hofmann, and Michael D. Nodine. Bookend: Precise transcript reconstruction with end-guided assembly. Genome Biology, 23(1):143, 2022. Google Scholar
  25. James A. Stapleton, Jeongwoon Kim, John P. Hamilton, Ming Wu, Luiz C. Irber, Rohan Maddamsetti, Bryan Briney, Linsey Newton, Dennis R. Burton, C. Titus Brown, Christina Chan, C. Robin Buell, and Timothy A. Whitehead. Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing. PLOS One, 11(1):e0147229, 2016. Google Scholar
  26. Gregory W Vurture, Fritz J Sedlazeck, Maria Nattestad, Charles J Underwood, Han Fang, James Gurtowski, and Michael C Schatz. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics, 33(14):2202-2204, 2017. Google Scholar
  27. Tasfia Zahin, Qian Shi, Xiaofei Carl Zang, and Mingfu Shao. Accurate assembly of circular rnas with terrace. In Jian Ma, editor, Research in Computational Molecular Biology, pages 444-447, Cham, 2024. Springer Nature Switzerland. Google Scholar
  28. Daniel R. Zerbino and Ewan Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18(5):821-829, 2008. Google Scholar
  29. Qimin Zhang, Qian Shi, and Mingfu Shao. Accurate assembly of multi-end rna-seq data with scallop2. Nature Computational Science, 2(3):148-152, 2022. Google Scholar