Algorithm Engineering for All-Pairs Suffix-Prefix Matching

Authors Jihyuk Lim, Kunsoo Park



PDF
Thumbnail PDF

File

LIPIcs.SEA.2017.14.pdf
  • Filesize: 0.73 MB
  • 12 pages

Document Identifiers

Author Details

Jihyuk Lim
Kunsoo Park

Cite AsGet BibTex

Jihyuk Lim and Kunsoo Park. Algorithm Engineering for All-Pairs Suffix-Prefix Matching. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 14:1-14:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.SEA.2017.14

Abstract

All-pairs suffix-prefix matching is an important part of DNA sequence assembly where it is the most time-consuming part of the whole assembly. Although there are algorithms for all-pairs suffix-prefix matching which are optimal in the asymptotic time complexity, they are slower than SOF and Readjoiner which are state-of-the-art algorithms used in practice. In this paper we present an algorithm for all-pairs suffix-prefix matching that uses a simple data structure for storing input strings and advanced algorithmic techniques for matching, which together lead to fast running time in practice. Our algorithm is 14 times faster than SOF and 18 times faster than Readjoiner on average in real datasets and random datasets.
Keywords
  • all-pairs suffix-prefix matching
  • algorithm engineering
  • DNA sequence assembly

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004. Google Scholar
  2. Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333-340, 1975. Google Scholar
  3. Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762-772, 1977. Google Scholar
  4. Maxime Crochemore, Artur Czumaj, Leszek Gąsieniec, Thierry Lecroq, Wojciech Plandowski, and Wojciech Rytter. Fast practical multi-pattern matching. Information Processing Letters, 71(3-4):107-113, 1999. Google Scholar
  5. Hieu Dinh and Sanguthevar Rajasekaran. A memory-efficient data structure representing exact-match overlap graphs with application for next-generation dna assembly. Bioinformatics, 27(14):1901-1907, 2011. Google Scholar
  6. Giorgio Gonnella and Stefan Kurtz. Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC bioinformatics, 13(1):82, 2012. Google Scholar
  7. Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press, 1997. Google Scholar
  8. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Information Processing Letters, 41(4):181-185, 1992. Google Scholar
  9. Maan Haj Rachid and Qutaibah Malluhi. A practical and scalable tool to find overlaps between sequences. BioMed research international, 2015, 2015. Google Scholar
  10. Maan Haj Rachid, Qutaibah Malluhi, and Mohamed Abouelhoda. Using the sadakane compressed suffix tree to solve the all-pairs suffix-prefix problem. BioMed research international, 2014, 2014. Google Scholar
  11. David Hernandez, Patrice François, Laurent Farinelli, Magne Østerås, and Jacques Schrenzel. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome research, 18(5):802-809, 2008. Google Scholar
  12. R. Nigel Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501-506, 1980. Google Scholar
  13. Felipe A. Louza, Simon Gog, Leandro Zanotto, Guido Araujo, and Guilherme P. Telles. Parallel computation for the all-pairs suffix-prefix problem. In International Symposium on String Processing and Information Retrieval, pages 122-132. Springer, 2016. Google Scholar
  14. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935-948, 1993. Google Scholar
  15. Eugene W. Myers. The fragment assembly string graph. Bioinformatics, 21(suppl 2):ii79-ii85, 2005. Google Scholar
  16. Enno Ohlebusch and Simon Gog. Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Information Processing Letters, 110(3):123-128, 2010. Google Scholar
  17. Maan Haj Rachid, Qutaibah Malluhi, and Mohamed Abouelhoda. A space-efficient solution to find the maximum overlap using a compressed suffix array. In Biomedical Engineering (MECBME), 2014 Middle East Conference on, pages 329-333. IEEE, 2014. Google Scholar
  18. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367-i373, 2010. Google Scholar
  19. William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix-prefix problem. Journal of Discrete Algorithms, 37:34-43, 2016. Google Scholar
  20. Sun Wu, Udi Manber, et al. A fast algorithm for multi-pattern searching. Technical report, University of Arizona. Department of Computer Science, 1994. Google Scholar