Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time

Authors Massimo Cairo, Romeo Rizzi , Alexandru I. Tomescu , Elia C. Zirondelli



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2021.43.pdf
  • Filesize: 1.33 MB
  • 18 pages

Document Identifiers

Author Details

Massimo Cairo
  • Department of Computer Science, University of Helsinki, Finland
Romeo Rizzi
  • Department of Computer Science, University of Verona, Italy
Alexandru I. Tomescu
  • Department of Computer Science, University of Helsinki, Finland
Elia C. Zirondelli
  • Department of Mathematics, University of Trento, Italy
  • Department of Computer Science, University of Verona, Italy

Acknowledgements

We thank Sebastian Schmidt for useful comments, including the observation that the bound on the total length of all maximal macrotigs can be improved to O(n) (from O(m) initially), Shahbaz Khan and Bastien Cazaux for helpful discussions and comments.

Cite AsGet BibTex

Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, and Elia C. Zirondelli. Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 43:1-43:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.43

Abstract

Genome assembly asks to reconstruct an unknown string from many shorter substrings of it. Even though it is one of the key problems in Bioinformatics, it is generally lacking major theoretical advances. Its hardness stems both from practical issues (size and errors of real data), and from the fact that problem formulations inherently admit multiple solutions. Given these, at their core, most state-of-the-art assemblers are based on finding non-branching paths (unitigs) in an assembly graph. While such paths constitute only partial assemblies, they are likely to be correct. More precisely, if one defines a genome assembly solution as a closed arc-covering walk of the graph, then unitigs appear in all solutions, being thus safe partial solutions. Until recently, it was open what are all the safe walks of an assembly graph. Tomescu and Medvedev (RECOMB 2016) characterized all such safe walks (omnitigs), thus giving the first safe and complete genome assembly algorithm. Even though omnitig finding was later improved to quadratic time, it remained open whether the crucial linear-time feature of finding unitigs can be attained with omnitigs. We answer this question affirmatively, by describing a surprising O(m)-time algorithm to identify all maximal omnitigs of a graph with n nodes and m arcs, notwithstanding the existence of families of graphs with Θ(mn) total maximal omnitig size. This is based on the discovery of a family of walks (macrotigs) with the property that all the non-trivial omnitigs are univocal extensions of subwalks of a macrotig. This has two consequences: (1) A linear-time output-sensitive algorithm enumerating all maximal omnitigs. (2) A compact O(m) representation of all maximal omnitigs, which allows, e.g., for O(m)-time computation of various statistics on them. Our results close a long-standing theoretical question inspired by practical genome assemblers, originating with the use of unitigs in 1995. We envision our results to be at the core of a reverse transfer from theory to practical and complete genome assembly programs, as has been the case for other key Bioinformatics problems.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Paths and connectivity problems
  • Theory of computation → Graph algorithms analysis
  • Applied computing → Computational biology
Keywords
  • Graph algorithm
  • strong connectivity
  • reachability under failures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 59-78. IEEE Computer Society, 2015. URL: https://doi.org/10.1109/FOCS.2015.14.
  2. Nidia Obscura Acosta, Veli Mäkinen, and Alexandru I. Tomescu. A safe and complete algorithm for metagenomic assembly. Algorithms for Molecular Biology, 13(1):3:1-3:12, 2018. URL: https://doi.org/10.1186/s13015-018-0122-7.
  3. Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403-410, 1990. Google Scholar
  4. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 51-58. ACM, 2015. URL: https://doi.org/10.1145/2746539.2746612.
  5. Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Irit Dinur, editor, IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 457-466. IEEE Computer Society, 2016. URL: https://doi.org/10.1109/FOCS.2016.56.
  6. Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In David B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 148-193. ACM, 2014. URL: https://doi.org/10.1145/2591796.2591885.
  7. Djamal Belazzougui and Simon J. Puglisi. Range predecessor and lempel-ziv parsing. In Robert Krauthgamer, editor, Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 2053-2071. SIAM, 2016. URL: https://doi.org/10.1137/1.9781611974331.ch143.
  8. Sébastien Boisvert, François Laviolette, and Jacques Corbeil. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of computational biology, 17(11):1519-1533, 2010. Google Scholar
  9. G. Bresler, M. Bresler, and D. Tse. Optimal Assembly for High Throughput Shotgun Sequencing. BMC Bioinformatics, 14(Suppl 5):S18, 2013. Google Scholar
  10. Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian S. Schmidt, Alexandru I. Tomescu, and Elia C. Zirondelli. Genome assembly, a universal theoretical framework: unifying and generalizing the safe and complete algorithms. CoRR, abs/2011.12635, 2020. URL: http://arxiv.org/abs/2011.12635.
  11. Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, and Alexandru I. Tomescu. An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph. ACM Trans. Algorithms, 15(4):48:1-48:17, 2019. URL: https://doi.org/10.1145/3341731.
  12. Massimo Cairo, Romeo Rizzi, Alexandru I Tomescu, and Elia C Zirondelli. Genome assembly, from practice to theory: safe, complete and linear-time. arXiv preprint, 2020. URL: http://arxiv.org/abs/2002.10498.
  13. Katarína Cechlárová. Persistency in the assignment and transportation problems. Mat. Meth. OR, 47(2):243-254, 1998. URL: https://doi.org/10.1007/BF01194399.
  14. Kun-Mao Chao, Ross C. Hardison, and Webb Miller. Locating well-conserved regions within a pairwise alignment. CABIOS, 9(4):387-396, 1993. URL: https://doi.org/10.1093/bioinformatics/9.4.387.
  15. Rayan Chikhi and Paul Medvedev. Informed and automated k-mer size selection for genome assembly. Bioinformatics, 30(1):31-37, June 2013. URL: https://doi.org/10.1093/bioinformatics/btt310.
  16. Marie Costa. Persistency in maximum cardinality bipartite matchings. Oper. Res. Lett., 15(3):143-9, 1994. URL: https://doi.org/10.1016/0167-6377(94)90049-3.
  17. Bartłomiej Dudek and Paweł Gawrychowski. Computing quartet distance is equivalent to counting 4-cycles. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 733-743, New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3313276.3316390.
  18. Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998. Google Scholar
  19. David Eppstein. K-best enumeration. Bulletin of the EATCS, 115, 2015. URL: http://eatcs.org/beatcs/index.php/beatcs/article/view/322.
  20. David Eppstein. k-best enumeration. In Ming-Yang Kao, editor, Encyclopedia of Algorithms, pages 1003-1006. Springer, New York, NY, 2016. URL: https://doi.org/10.1007/978-1-4939-2864-4_733.
  21. Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexity of string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 55:1-55:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICALP.2019.55.
  22. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390-398. IEEE Computer Society, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  23. Paolo Ferragina, Igor Nitto, and Rossano Venturini. On the bit-complexity of lempel-ziv compression. In Claire Mathieu, editor, Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009, pages 768-777. SIAM, 2009. URL: https://doi.org/10.1137/1.9781611973068.
  24. A Friemann and S Schmitz. A new approach for displaying identities and differences among aligned amino acid sequences. Comput Appl Biosci, 8(3):261-265, June 1992. Google Scholar
  25. Loukas Georgiadis, Giuseppe F Italiano, and Nikos Parotsidis. Strong connectivity in directed graphs under failures, with applications. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1880-1899. SIAM, 2017. Google Scholar
  26. Meigu Guan. Graphic programming using odd and even points. Chinese Math., 1:237-277, 1962. Google Scholar
  27. A. Guénoche. Can we recover a sequence, just knowing all its subsequences of given length? Computer Applications in the Biosciences, 8(6):569-574, 1992. URL: http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics8.html#Guenoche92.
  28. P. L. Hammer, P. Hansen, and B. Simeone. Vertices belonging to all or to no maximum stable sets of a graph. SIAM Journal on Algebraic Discrete Methods, 3(4):511-522, 1982. URL: https://doi.org/10.1137/0603052.
  29. Iu, V. L. Florent'ev, A. A. Khorlin, K. R. Khrapko, and V. V. Shik. Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. A new method. Doklady Akademii nauk SSSR, 303(6):1508-1511, 1988. URL: http://view.ncbi.nlm.nih.gov/pubmed/3250844.
  30. Benjamin Grant Jackson. Parallel methods for short read assembly. PhD thesis, Iowa State University, 2009. Google Scholar
  31. Evgeny Kapun and Fedor Tsarev. De Bruijn superwalk with multiplicities problem is NP-hard. BMC Bioinformatics, 14(Suppl 5):S7, 2013. Google Scholar
  32. John D. Kececioglu and Eugene W. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1/2):7-51, 1995. Google Scholar
  33. John Dimitri Kececioglu. Exact and approximation algorithms for DNA sequence reconstruction. PhD thesis, University of Arizona, Tucson, AZ, USA, 1992. Google Scholar
  34. Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
  35. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
  36. Carl Kingsford, Michael C Schatz, and Mihai Pop. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics, 11(1):21, 2010. Google Scholar
  37. Ka-Kit Lam, Asif Khalak, and David Tse. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinform., 15(S-9):S4, 2014. URL: https://doi.org/10.1186/1471-2105-15-S9-S4.
  38. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature Methods, 9(4):357, 2012. Google Scholar
  39. Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics, 31(10):1674-1676, 2015. Google Scholar
  40. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  41. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. URL: https://doi.org/10.1017/CBO9781139940023.
  42. Paul Medvedev. Modeling biological problems in computer science: a case study in genome assembly. Briefings in bioinformatics, 20(4):1376-1383, 2019. Google Scholar
  43. Paul Medvedev and Michael Brudno. Maximum likelihood genome assembly. Journal of computational biology, 16(8):1101-1116, 2009. Google Scholar
  44. Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In WABI, pages 289-301, 2007. Google Scholar
  45. Eugene W. Myers. The fragment assembly string graph. In ECCB/JBI, page 85, 2005. Google Scholar
  46. Niranjan Nagarajan and Mihai Pop. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. Journal of computational biology, 16(7):897-908, 2009. Google Scholar
  47. Niranjan Nagarajan and Mihai Pop. Sequence assembly demystified. Nature Reviews Genetics, 14(3):157-167, 2013. Google Scholar
  48. Giuseppe Narzisi, Bud Mishra, and Michael C Schatz. On algorithmic complexity of biomolecular sequence assembly problem. In Algorithms for Computational Biology, pages 183-195. Springer, 2014. Google Scholar
  49. Hannu Peltola, Hans Söderlund, Jorma Tarhio, and Esko Ukkonen. Algorithms for some string matching problems arising in molecular genetics. In IFIP Congress, pages 59-64, 1983. Google Scholar
  50. P. A. Pevzner. l-Tuple DNA sequencing: computer analysis. Journal of Biomolecular Structure & Dynamics, 7(1):63-73, 1989. Google Scholar
  51. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748-9753, 2001. Google Scholar
  52. Jue Ruan and Heng Li. Fast and accurate long-read assembly with wtdbg2. Nature Methods, 17(2):155-158, 2020. URL: https://doi.org/10.1038/s41592-019-0669-3.
  53. Ilan Shomorony, Samuel H. Kim, Thomas A. Courtade, and David N. C. Tse. Information-optimal genome assembly via sparse read-overlap graphs. Bioinform., 32(17):494-502, 2016. URL: https://doi.org/10.1093/bioinformatics/btw450.
  54. Alexandru I. Tomescu and Paul Medvedev. Safe and Complete Contig Assembly Via Omnitigs. In Mona Singh, editor, Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Santa Monica, CA, USA, April 17-21, 2016, Proceedings, volume 9649 of Lecture Notes in Computer Science, pages 152-163. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-31957-5_11.
  55. Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly through omnitigs. Journal of Computational Biology, 24(6):590-602, 2017. Google Scholar
  56. Martin Vingron and Patrick Argos. Determination of reliable regions in protein sequence alignments. Prot. Engin., 3(7):565-569, 1990. URL: https://doi.org/10.1093/protein/3.7.565.
  57. Virginia Vassilevska Williams, Joshua R. Wang, Richard Ryan Williams, and Huacheng Yu. Finding four-node subgraphs in triangle time. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 1671-1680. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.111.
  58. Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, Jun-Hua Tian, Yuan-Yuan Pei, Ming-Li Yuan, Yu-Ling Zhang, Fa-Hui Dai, Yi Liu, Qi-Min Wang, Jiao-Jiao Zheng, Lin Xu, Edward C. Holmes, and Yong-Zhen Zhang. A new coronavirus associated with human respiratory disease in china. Nature, 579(7798):265-269, 2020. URL: https://doi.org/10.1038/s41586-020-2008-3.
  59. M Zuker. Suboptimal sequence alignment in molecular biology. alignment with error analysis. J Mol Biol, 221(2):403-420, September 1991. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail