Applying the Safe-And-Complete Framework to Practical Genome Assembly

Authors Sebastian Schmidt , Santeri Toivonen, Paul Medvedev, Alexandru I. Tomescu



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.8.pdf
  • Filesize: 0.98 MB
  • 16 pages

Document Identifiers

Author Details

Sebastian Schmidt
  • Department of Computer Science, University of Helsinki, Finland
Santeri Toivonen
  • Department of Computer Science, University of Helsinki, Finland
Paul Medvedev
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
  • Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
Alexandru I. Tomescu
  • Department of Computer Science, University of Helsinki, Finland

Acknowledgements

PM would like to thank John Hutton for early attempts to extend omnitigs to work in practice [Hutton, 2018]. The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources.

Cite AsGet BibTex

Sebastian Schmidt, Santeri Toivonen, Paul Medvedev, and Alexandru I. Tomescu. Applying the Safe-And-Complete Framework to Practical Genome Assembly. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 8:1-8:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.8

Abstract

Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity. The optimal algorithm in this framework, called the omnitig algorithm, has not been used in practice due to its complexity and its lack of robustness to real data. Instead, we pursue a simplified notion of omnitigs (simple omnitigs), giving an efficient algorithm to compute them and demonstrating their safety under certain conditions. We modify two assemblers (wtdbg2 and Flye) by replacing their unitig algorithm with the simple omnitig algorithm. We test our modifications using real HiFi data from the D. melanogaster and the C. elegans genomes. Our modified algorithms lead to a substantial improvement in alignment-based contiguity, with negligible additional computational costs and either no or a small increase in the number of misassemblies.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
  • Mathematics of computing → Paths and connectivity problems
  • Theory of computation → Graph algorithms analysis
Keywords
  • Genome assembly
  • Omnitigs
  • Safe-and-complete framework
  • graph algorithm
  • HiFi sequencing data
  • Assembly evaluation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Anton Bankevich, Andrey V Bzikadze, Mikhail Kolmogorov, Dmitry Antipov, and Pavel A Pevzner. Multiplex de bruijn graphs enable genome assembly from long, high-fidelity reads. Nature biotechnology, pages 1-7, 2022. Google Scholar
  2. Guy Bresler, Maquotesingleayan Bresler, and David Tse. Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics, 14(S5), April 2013. URL: https://doi.org/10.1186/1471-2105-14-s5-s18.
  3. Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I Tomescu, and Elia C Zirondelli. The hydrostructure: a universal framework for safe and complete algorithms for genome assembly. arXiv preprint arXiv:2011.12635, 2020. Google Scholar
  4. Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian S. Schmidt, Alexandru I. Tomescu, and Elia C. Zirondelli. Cut paths and their remainder structure, with applications. In Petra Berenbrink et al., editors, STACS 2023, volume 254 of LIPIcs, pages 17:1-17:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.STACS.2023.17.
  5. Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, and Alexandru I Tomescu. An optimal O(nm) algorithm for enumerating all walks common to all closed edge-covering walks of a graph. ACM Transactions on Algorithms (TALG), 15(4):1-17, 2019. Google Scholar
  6. Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, and Elia C. Zirondelli. Genome assembly, from practice to theory: Safe, complete and Linear-Time. ACM Trans. Algorithms, 20(1):4:1-4:26, 2024. URL: https://doi.org/10.1145/3632176.
  7. Haoyu Cheng, Gregory T Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods, 18(2):170-175, 2021. Google Scholar
  8. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. Google Scholar
  9. Andrea Cracco and Alexandru I Tomescu. Extremely fast construction and querying of compacted and colored de bruijn graphs with ggcat. Genome Research, 33(7):1198-1207, 2023. Google Scholar
  10. C. elegans Sequencing Consortium*. Genome sequence of the nematode c. elegans: a platform for investigating biology. Science, 282(5396):2012-2018, 1998. Google Scholar
  11. Loukas Georgiadis, Giuseppe F Italiano, and Nikos Parotsidis. Strong connectivity in directed graphs under failures, with applications. SIAM Journal on Computing, 49(5):865-926, 2020. Google Scholar
  12. John Hutton. Extended safe contigs in the face of incomplete coverage. Masters thesis, Pennsylvania State University, 2018. Google Scholar
  13. Benjamin Grant Jackson. Parallel methods for short read assembly. Iowa State University, Ph.D. thesis, 2009. Google Scholar
  14. Carl Kingsford, Michael C Schatz, and Mihai Pop. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics, 11(1), 2010. URL: https://doi.org/10.1186/1471-2105-11-21.
  15. Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5):540-546, 2019. Google Scholar
  16. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018. Google Scholar
  17. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-scale Algorithm Design: Bioinformatics in the Era of High-throughput Sequencing. Cambridge University Press, 2023. Google Scholar
  18. Paul Medvedev. Theoretical analysis of sequencing bioinformatics algorithms and beyond. Commun. ACM, 66(7):118-125, June 2023. URL: https://doi.org/10.1145/3571723.
  19. Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In WABI, pages 289-301, 2007. Google Scholar
  20. Paul Medvedev and Mihai Pop. What do eulerian and hamiltonian cycles have to do with genome assembly? PLOS Computational Biology, 17(5):e1008928, May 2021. URL: https://doi.org/10.1371/journal.pcbi.1008928.
  21. Alla Mikheenko, Andrey Prjibelski, Vladislav Saveliev, Dmitry Antipov, and Alexey Gurevich. Versatile genome assembly evaluation with quast-lg. Bioinformatics, 34(13):i142-i150, 2018. Google Scholar
  22. Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B Hall, Christopher H Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O Twardziok, Alexander Kanitz, et al. Sustainable data analysis with snakemake. F1000Research, 10, 2021. Google Scholar
  23. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. Google Scholar
  24. Sergey Nurk, Brian P Walenz, Arang Rhie, Mitchell R Vollger, Glennis A Logsdon, Robert Grothe, Karen H Miga, Evan E Eichler, Adam M Phillippy, and Sergey Koren. Hicanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome research, 30(9):1291-1305, 2020. Google Scholar
  25. Nidia Obscura Acosta, Veli Mäkinen, and Alexandru I Tomescu. A safe and complete algorithm for metagenomic assembly. Algorithms for Molecular Biology, 13(1):1-12, 2018. Google Scholar
  26. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748-9753, 2001. Google Scholar
  27. Amatur Rahman and Paul Medvedev. Assembler artifacts include misassembly because of unsafe unitigs and underassembly because of bidirected graphs. Genome Research, 32(9):1746-1753, 2022. Google Scholar
  28. Arang Rhie, Shane A McCarthy, Olivier Fedrigo, Joana Damas, Giulio Formenti, Sergey Koren, Marcela Uliano-Silva, William Chow, Arkarachai Fungtammasan, Juwan Kim, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737-746, 2021. Google Scholar
  29. Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2. Nature methods, 17(2):155-158, 2020. Google Scholar
  30. Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, et al. Gage: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557-567, 2012. Google Scholar
  31. Sebastian Schmidt. Flye YV. https://github.com/sebschmi/Flye, 2024.
  32. Sebastian Schmidt. homopolymer-compress-rs. https://github.com/sebschmi/homopolymer-compress-rs, 2024.
  33. Sebastian Schmidt. practical-omnitigs, 2024. Software, swhId: https://archive.softwareheritage.org/swh:1:rev:bb1de69873c6b48f183e51bca2f48d2a057b8b64;origin=https://github.com/algbio/practical-omnitigs;visit=swh:1:snp:04e6ac0423d201dfab2c8d8ebe834756a6f88de9 (visited on 2024-08-14). URL: https://github.com/algbio/practical-omnitigs.
  34. Sebastian Schmidt. QUAST 5.0.2 modified to be robust against overlapping contigs. https://github.com/sebschmi/quast, 2024.
  35. Sebastian Schmidt. wtdbg2-homopolymer-decompression. https://github.com/sebschmi/wtdbg2-homopolymer-decompression, 2024.
  36. Sebastian Schmidt. wtdbg2 YV. https://github.com/sebschmi/wtdbg2, 2024.
  37. Alexandru I Tomescu and Paul Medvedev. Safe and complete contig assembly through omnitigs. Journal of Computational Biology, 24(6):590-602, 2017. Google Scholar
  38. Andy B Yoo, Morris A Jette, and Mark Grondona. Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing, pages 44-60. Springer, 2003. Google Scholar