Fast matching statistics in small space

Authors Djamal Belazzougui, Fabio Cunial, Olgert Denas



PDF
Thumbnail PDF

File

LIPIcs.SEA.2018.17.pdf
  • Filesize: 0.5 MB
  • 14 pages

Document Identifiers

Author Details

Djamal Belazzougui
  • DTISI-CERIST, Algiers, Algeria.
Fabio Cunial
  • MPI-CBG and CSBD, Dresden, Germany.
Olgert Denas
  • Adobe Inc., San Jose, California, USA.

Cite As Get BibTex

Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In 17th International Symposium on Experimental Algorithms (SEA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 103, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.SEA.2018.17

Abstract

Computing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor's memory, and they approach comparable running times when S and T are very similar.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
  • Theory of computation → Pattern matching
Keywords
  • Matching statistics
  • maximal repeat
  • Burrows-Wheeler transform
  • wavelet tree
  • suffix tree topology

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alberto Apostolico, Concettina Guerra, Gad M. Landau, and Cinzia Pizzi. Sequence similarity measures based on bounded Hamming distance. Theoretical Computer Science, 638:76-90, 2016. Google Scholar
  2. Uwe Baier, Timo Beller, and Enno Ohlebusch. Space-efficient parallel construction of succinct representations of suffix tree topologies. Journal of Experimental Algorithmics (JEA), 22:1-1, 2017. Google Scholar
  3. Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique substrings. In International Symposium on String Processing and Information Retrieval, pages 179-190. Springer, 2014. Google Scholar
  4. Djamal Belazzougui and Fabio Cunial. A framework for space-efficient string kernels. Algorithmica, 79(3):857-883, 2017. Google Scholar
  5. Eyal Cohen and Benny Chor. Detecting phylogenetic signals in eukaryotic whole genome sequences. Journal of Computational Biology, 19(8):945-956, 2012. Google Scholar
  6. Martin Farach, Michiel Noordewier, Serap Savari, Larry Shepp, Abraham Wyner, and Jacob Ziv. On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. In Proceedings of the sixth annual ACM-SIAM symposium on discrete algorithms, pages 48-57, 1995. Google Scholar
  7. Richard F Geary, Naila Rahman, Rajeev Raman, and Venkatesh Raman. A simple optimal representation for balanced parentheses. Theoretical Computer Science, 368(3):231-246, 2006. Google Scholar
  8. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. Google Scholar
  9. R. González, G. Navarro, and H. Ferrada. Locally compressed suffix arrays. ACM Journal of Experimental Algorithmics, 19(1):article 1, 2014. Google Scholar
  10. Stefan Kurtz. Reducing the space requirement of suffix trees. Software-Practice and Experience, 29(13):1149-71, 1999. Google Scholar
  11. Chris-Andre Leimeister and Burkhard Morgenstern. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000-2008, 2014. Google Scholar
  12. Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms (TALG), 10(3):16, 2014. Google Scholar
  13. Enno Ohlebusch, Johannes Fischer, and Simon Gog. CST++. In International Symposium on String Processing and Information Retrieval, pages 322-333. Springer, 2010. Google Scholar
  14. Enno Ohlebusch and Simon Gog. A compressed enhanced suffix array supporting fast string matching. In International Symposium on String Processing and Information Retrieval, pages 51-62. Springer, 2009. Google Scholar
  15. Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In SPIRE, pages 347-358, 2010. Google Scholar
  16. Daisuke Okanohara and Kunihiko Sadakane. A linear-time Burrows-Wheeler transform using induced sorting. In International Symposium on String Processing and Information Retrieval, pages 90-101. Springer, 2009. Google Scholar
  17. Nicolas Philippe, Mikaël Salson, Thérèse Commes, and Eric Rivals. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology, 14(3):R30, 2013. Google Scholar
  18. Cinzia Pizzi. Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms for Molecular Biology, 11(1):6, 2016. Google Scholar
  19. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589-607, 2007. Google Scholar
  20. Kunihiko Sadakane. The ultimate balanced parentheses. Technical report, Kyushu University, 2008. Google Scholar
  21. Kunihiko Sadakane and Gonzalo Navarro. Fully-functional succinct trees. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 134-149. SIAM, 2010. Google Scholar
  22. Choon Hui Teo and S.V.N. Vishwanathan. Fast and space efficient string kernels using suffix arrays. In Proceedings of the 23rd international conference on Machine learning, pages 929-936. ACM, 2006. Google Scholar
  23. Sharma V Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472-482, 2016. Google Scholar
  24. Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Ambujam Krishnan, and Srinivas Aluru. A greedy alignment-free distance estimator for phylogenetic inference. BMC bioinformatics, 18(8):238, 2017. Google Scholar
  25. Igor Ulitsky, David Burstein, Tamir Tuller, and Benny Chor. The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13(2):336-350, 2006. Google Scholar
  26. Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973, pages 1-11. IEEE, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail