Optimal Omnitig Listing for Safe and Complete Contig Assembly

Cairo, Massimo; Medvedev, Paul; Obscura Acosta, Nidia; Rizzi, Romeo; Tomescu, Alexandru I.

doi:10.4230/LIPIcs.CPM.2017.29

Abstract

Genome assembly is the problem of reconstructing a genome sequence from a set of reads from a sequencing experiment. Typical formulations of the assembly problem admit in practice many genomic reconstructions, and actual genome assemblers usually output contigs, namely substrings that are promised to occur in the genome. To bridge the theory and practice, Tomescu and Medvedev [RECOMB 2016] reformulated contig assembly as finding all substrings common to all genomic reconstructions. They also gave a characterization of those walks (omnitigs) that are common to all closed edge-covering walks of a (directed) graph, a typical notion of genomic reconstruction. An algorithm for listing all maximal omnitigs was also proposed, by launching an exhaustive visit from every edge.

In this paper, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Omega(nm). We implement this algorithm and show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev [RECOMB 2016].

Donatella Firmani, Giuseppe F. Italiano, Luigi Laura, Alessio Orlandi, and Federico Santaroni. Computing strong articulation points and strong bridges in large scale graphs. In Ralf Klasing, editor, Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2012), volume 7276 of LNCS, pages 195-207, Berlin, Heidelberg, 2012. Springer-Verlag. URL: http://dx.doi.org/10.1007/978-3-642-30850-5_18.
Ramana M. Idury and Michael S. Waterman. A new algorithm for DNA sequence assembly. J. Comput. Biol., 2(2):291-306, 1995. URL: http://dx.doi.org/10.1089/cmb.1995.2.291.
Giuseppe F. Italiano, Luigi Laura, and Federico Santaroni. Finding strong bridges and strong articulation points in linear time. Theor. Comput. Sci., 447:74-84, August 2012. URL: http://dx.doi.org/10.1016/j.tcs.2011.11.011.
Benjamin Grant Jackson. Parallel methods for short read assembly. PhD thesis, Iowa State University, 2009. URL: http://lib.dr.iastate.edu/etd/10704.
Evgeny Kapun and Fedor Tsarev. De Bruijn superwalk with multiplicities problem is NP-hard. BMC Bioinformatics, 14(S-5):S7, 2013. URL: http://dx.doi.org/10.1186/1471-2105-14-S5-S7.
John D. Kececioglu and Eugene W. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1/2):7-51, 1995. URL: http://dx.doi.org/10.1007/BF01188580.
Carl Kingsford, Michael C. Schatz, and Mihai Pop. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics, 11(1):21, 2010. URL: http://dx.doi.org/10.1186/1471-2105-11-21.
Yuri P. Lysov, Vladimir L. Florentiev, Alexandr A. Khorlin, Konstantin R. Khrapko, and Valentine V. Shik. Determination of the nucleotide sequence of dna using hybridization with oligonucleotides. A new method. Dokl. Akad. Nauk SSSR, 303(6):1508-1511, 1988. URL: http://view.ncbi.nlm.nih.gov/pubmed/3250844.
Paul Medvedev and Michael Brudno. Maximum likelihood genome assembly. J. Comput. Biol., 16(8):1101-1116, 2009. URL: http://dx.doi.org/10.1089/cmb.2009.0047.
Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno. Computability of models for sequence assembly. In Raffaele Giancarlo and Sridhar Hannenhalli, editors, Proceedings of the 7th International Workshop on Algorithms in Bioinformatics (WABI 2007), volume 4645 of LNCS, pages 289-301. Springer, 2007. URL: http://dx.doi.org/10.1007/978-3-540-74126-8_27.
Gene Myers. Efficient local alignment discovery amongst noisy long reads. In Daniel G. Brown and Burkhard Morgenstern, editors, Proceedings of the 14th International Workshop on Algorithms in Bioinformatics (WABI 2014), volume 8701 of LNCS, pages 52-67. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-662-44753-6_5.
Niranjan Nagarajan and Mihai Pop. Parametric complexity of sequence assembly: Theory and applications to next generation sequencing. J. Comput. Biol., 16(7):897-908, 2009. URL: http://dx.doi.org/10.1089/cmb.2009.0005.
Giuseppe Narzisi, Bud Mishra, and Michael C. Schatz. On algorithmic complexity of biomolecular sequence assembly problem. In Adrian-Horia Dediu, Carlos Martín-Vide, and Bianca Truthe, editors, Proceedings of the 1st International Conference on Algorithms for Computational Biology (AlCoB 2014), volume 8542 of LNCS, pages 183-195. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-319-07953-0_15.
Pavel A. Pevzner. L-Tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn., 7(1):63-73, August 1989. URL: http://www.tandfonline.com/doi/abs/10.1080/07391102.1989.10507752.
Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U.S.A., 98(17):9748-9753, 2001. URL: http://dx.doi.org/10.1073/PNAS.171285098.
Jared T. Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome Res., 22(3):549-556, 2012. URL: http://dx.doi.org/10.1101/GR.126953.111.
Alexandru I. Tomescu and Paul Medvedev. Safe and complete contig assembly via omnitigs. In Mona Singh, editor, Proceedings of the 20th Annual Conference on Research in Computational Molecular Biology (RECOMB 2016), volume 9649 of LNCS, pages 152-163. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-31957-5_11.
Michael S. Waterman. Introduction to Computational Biology: Maps, Sequences and Genomes, volume 1 of Chapman &Hall/CRC Interdisciplinary Statistics. CRC Press, 1995. URL: https://www.crcpress.com/9780412993916.

Optimal Omnitig Listing for Safe and Complete Contig Assembly

Authors Massimo Cairo, Paul Medvedev, Nidia Obscura Acosta, Romeo Rizzi, Alexandru I. Tomescu

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message