Optimal Construction of Hierarchical Overlap Graphs

Khan, Shahbaz

doi:10.4230/LIPIcs.CPM.2021.17

File

Subject Classification

ACM Subject Classification

Mathematics of computing → Trees
Theory of computation → Data compression
Theory of computation → Pattern matching

Keywords

Hierarchical Overlap Graphs
String algorithms
Genome assembly

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches. For a given set P of n strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring O(||P||+n²) time using superlinear space, where ||P|| is the cumulative sum of the lengths of strings in P. This was improved by Park et al. [SPIRE20] to O(||P||log n) time and O(||P||) space using segment trees, and further to O(||P||(log n)/(log log n)) for the word RAM model. Both these results described an open problem to compute HOG in optimal O(||P||) time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures. At its core, our solution improves the classical result [IPL92] for a special case of the All Pairs Suffix Prefix (APSP) problem from O(||P||+n²) time to optimal O(||P||) time, which may be of independent interest.

Cite As Get BibTex

Shahbaz Khan. Optimal Construction of Hierarchical Overlap Graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 17:1-17:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.CPM.2021.17

References

Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975.
Dmitry Antipov, Anton I. Korobeynikov, Jeffrey S. McLean, and Pavel A. Pevzner. hybridspades: an algorithm for hybrid assembly of short and long reads. Bioinform., 32(7):1009-1015, 2016.
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son K. Pham, Andrey D. Prjibelski, Alex Pyshkin, Alexander Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. Spades: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19(5):455-477, 2012.
Djamal Belazzougui and Fabio Cunial. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 10:1-10:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, and Simon J. Puglisi. Bidirectional variable-order de bruijn graphs. Int. J. Found. Comput. Sci., 29(8):1279-1295, 2018.
Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis. Linear approximation of shortest superstrings. J. ACM, 41(4):630-647, 1994.
Christina Boucher, Alexander Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de bruijn graphs. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2015 Data Compression Conference, DCC 2015, Snowbird, UT, USA, April 7-9, 2015, pages 383-392. IEEE, 2015.
Rodrigo Cánovas, Bastien Cazaux, and Eric Rivals. The compressed overlap index. CoRR, abs/1707.05613, 2017. URL: http://arxiv.org/abs/1707.05613.
Bastien Cazaux, Rodrigo Cánovas, and Eric Rivals. Shortest DNA cyclic cover in compressed space. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2016 Data Compression Conference, DCC 2016, Snowbird, UT, USA, March 30 - April 1, 2016, pages 536-545. IEEE, 2016.
Bastien Cazaux and Eric Rivals. Hierarchical overlap graph. Inf. Process. Lett., 155, 2020.
Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41(4):181-185, 1992.
Shahbaz Khan. Optimal construction of hierarchical overlap graphs. CoRR, abs/2102.02873, 2021. URL: http://arxiv.org/abs/2102.02873.
Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977.
Jihyuk Lim and Kunsoo Park. A fast algorithm for the all-pairs suffix-prefix problem. Theor. Comput. Sci., 698:14-24, 2017.
Eugene W. Myers. The fragment assembly string graph. Bioinformatics, 21(2):79–85, 2005. URL: https://doi.org/10.1093/bioinformatics/bti1114.
Sergey Nurk, Dmitry Meleshko, Anton I. Korobeynikov, and Pavel A. Pevzner. metaspades: A new versatile de novo metagenomics assembler. In Mona Singh, editor, Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Santa Monica, CA, USA, April 17-21, 2016, Proceedings, volume 9649 of Lecture Notes in Computer Science, page 258. Springer, 2016.
Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. A linear time algorithm for constructing hierarchical overlap graphs. CoRR, abs/2102.12824, 2021 (accepted for publishing at CPM 2021). URL: http://arxiv.org/abs/2102.12824.
Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. Efficient construction of hierarchical overlap graphs. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 277-290. Springer, 2020.
Hannu Peltola, Hans Söderlund, Jorma Tarhio, and Esko Ukkonen. Algorithms for some string matching problems arising in molecular genetics. In IFIP Congress, pages 59-64, 1983.
P. A. Pevzner. l-Tuple DNA sequencing: computer analysis. Journal of Biomolecular Structure & Dynamics, 7(1):63-73, 1989.
Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 98(17):9748-9753, 2001.
Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinform., 26(12):367-373, 2010.
Z. Sweedyk. A 2onehalf-approximation algorithm for shortest superstring. SIAM J. Comput., 29(3):954-986, 1999.
William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms, 37:34-43, 2016.
Esko Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5(3):313-323, 1990.
Daniel R Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821—829, May 2008. URL: https://doi.org/10.1101/gr.074492.107.

Optimal Construction of Hierarchical Overlap Graphs

Author Shahbaz Khan

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Optimal Construction of Hierarchical Overlap Graphs

Author Shahbaz Khan

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message