Optimal Construction of Hierarchical Overlap Graphs

Author Shahbaz Khan



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.17.pdf
  • Filesize: 0.81 MB
  • 11 pages

Document Identifiers

Author Details

Shahbaz Khan
  • University of Helsinki, Finland

Acknowledgements

I would like to thank Alexandru I. Tomescu for helpful discussions, and for critical review and insightful suggestions which helped me in refining the paper. I would also like to thank Veli Mäkinen for pointing out the similarity with the classical result for APSP problem.

Cite AsGet BibTex

Shahbaz Khan. Optimal Construction of Hierarchical Overlap Graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 17:1-17:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CPM.2021.17

Abstract

Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches. For a given set P of n strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring O(||P||+n²) time using superlinear space, where ||P|| is the cumulative sum of the lengths of strings in P. This was improved by Park et al. [SPIRE20] to O(||P||log n) time and O(||P||) space using segment trees, and further to O(||P||(log n)/(log log n)) for the word RAM model. Both these results described an open problem to compute HOG in optimal O(||P||) time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures. At its core, our solution improves the classical result [IPL92] for a special case of the All Pairs Suffix Prefix (APSP) problem from O(||P||+n²) time to optimal O(||P||) time, which may be of independent interest.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Trees
  • Theory of computation → Data compression
  • Theory of computation → Pattern matching
Keywords
  • Hierarchical Overlap Graphs
  • String algorithms
  • Genome assembly

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333-340, 1975. Google Scholar
  2. Dmitry Antipov, Anton I. Korobeynikov, Jeffrey S. McLean, and Pavel A. Pevzner. hybridspades: an algorithm for hybrid assembly of short and long reads. Bioinform., 32(7):1009-1015, 2016. Google Scholar
  3. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son K. Pham, Andrey D. Prjibelski, Alex Pyshkin, Alexander Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. Spades: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19(5):455-477, 2012. Google Scholar
  4. Djamal Belazzougui and Fabio Cunial. Fully-functional bidirectional burrows-wheeler indexes and infinite-order de bruijn graphs. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 10:1-10:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. Google Scholar
  5. Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, and Simon J. Puglisi. Bidirectional variable-order de bruijn graphs. Int. J. Found. Comput. Sci., 29(8):1279-1295, 2018. Google Scholar
  6. Avrim Blum, Tao Jiang, Ming Li, John Tromp, and Mihalis Yannakakis. Linear approximation of shortest superstrings. J. ACM, 41(4):630-647, 1994. Google Scholar
  7. Christina Boucher, Alexander Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de bruijn graphs. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2015 Data Compression Conference, DCC 2015, Snowbird, UT, USA, April 7-9, 2015, pages 383-392. IEEE, 2015. Google Scholar
  8. Rodrigo Cánovas, Bastien Cazaux, and Eric Rivals. The compressed overlap index. CoRR, abs/1707.05613, 2017. URL: http://arxiv.org/abs/1707.05613.
  9. Bastien Cazaux, Rodrigo Cánovas, and Eric Rivals. Shortest DNA cyclic cover in compressed space. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2016 Data Compression Conference, DCC 2016, Snowbird, UT, USA, March 30 - April 1, 2016, pages 536-545. IEEE, 2016. Google Scholar
  10. Bastien Cazaux and Eric Rivals. Hierarchical overlap graph. Inf. Process. Lett., 155, 2020. Google Scholar
  11. Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008. Google Scholar
  12. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  13. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41(4):181-185, 1992. Google Scholar
  14. Shahbaz Khan. Optimal construction of hierarchical overlap graphs. CoRR, abs/2102.02873, 2021. URL: http://arxiv.org/abs/2102.02873.
  15. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. Google Scholar
  16. Jihyuk Lim and Kunsoo Park. A fast algorithm for the all-pairs suffix-prefix problem. Theor. Comput. Sci., 698:14-24, 2017. Google Scholar
  17. Eugene W. Myers. The fragment assembly string graph. Bioinformatics, 21(2):79–85, 2005. URL: https://doi.org/10.1093/bioinformatics/bti1114.
  18. Sergey Nurk, Dmitry Meleshko, Anton I. Korobeynikov, and Pavel A. Pevzner. metaspades: A new versatile de novo metagenomics assembler. In Mona Singh, editor, Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Santa Monica, CA, USA, April 17-21, 2016, Proceedings, volume 9649 of Lecture Notes in Computer Science, page 258. Springer, 2016. Google Scholar
  19. Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. A linear time algorithm for constructing hierarchical overlap graphs. CoRR, abs/2102.12824, 2021 (accepted for publishing at CPM 2021). URL: http://arxiv.org/abs/2102.12824.
  20. Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. Efficient construction of hierarchical overlap graphs. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 277-290. Springer, 2020. Google Scholar
  21. Hannu Peltola, Hans Söderlund, Jorma Tarhio, and Esko Ukkonen. Algorithms for some string matching problems arising in molecular genetics. In IFIP Congress, pages 59-64, 1983. Google Scholar
  22. P. A. Pevzner. l-Tuple DNA sequencing: computer analysis. Journal of Biomolecular Structure & Dynamics, 7(1):63-73, 1989. Google Scholar
  23. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences of the United States of America, 98(17):9748-9753, 2001. Google Scholar
  24. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinform., 26(12):367-373, 2010. Google Scholar
  25. Z. Sweedyk. A 2onehalf-approximation algorithm for shortest superstring. SIAM J. Comput., 29(3):954-986, 1999. Google Scholar
  26. William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms, 37:34-43, 2016. Google Scholar
  27. Esko Ukkonen. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica, 5(3):313-323, 1990. Google Scholar
  28. Daniel R Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821—829, May 2008. URL: https://doi.org/10.1101/gr.074492.107.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail