Fast matching statistics in small space

Belazzougui, Djamal; Cunial, Fabio; Denas, Olgert

doi:10.4230/LIPIcs.SEA.2018.17

File

Author Details

Djamal Belazzougui

DTISI-CERIST, Algiers, Algeria.

Fabio Cunial

MPI-CBG and CSBD, Dresden, Germany.

Olgert Denas

Adobe Inc., San Jose, California, USA.

Cite AsGet BibTex

Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In 17th International Symposium on Experimental Algorithms (SEA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 103, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.SEA.2018.17

Abstract

Computing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor's memory, and they approach comparable running times when S and T are very similar.

Subject Classification

ACM Subject Classification

Theory of computation → Data structures design and analysis
Theory of computation → Pattern matching

Keywords

Matching statistics
maximal repeat
Burrows-Wheeler transform
wavelet tree
suffix tree topology

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alberto Apostolico, Concettina Guerra, Gad M. Landau, and Cinzia Pizzi. Sequence similarity measures based on bounded Hamming distance. Theoretical Computer Science, 638:76-90, 2016.
Uwe Baier, Timo Beller, and Enno Ohlebusch. Space-efficient parallel construction of succinct representations of suffix tree topologies. Journal of Experimental Algorithmics (JEA), 22:1-1, 2017.
Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique substrings. In International Symposium on String Processing and Information Retrieval, pages 179-190. Springer, 2014.
Djamal Belazzougui and Fabio Cunial. A framework for space-efficient string kernels. Algorithmica, 79(3):857-883, 2017.
Eyal Cohen and Benny Chor. Detecting phylogenetic signals in eukaryotic whole genome sequences. Journal of Computational Biology, 19(8):945-956, 2012.
Martin Farach, Michiel Noordewier, Serap Savari, Larry Shepp, Abraham Wyner, and Jacob Ziv. On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. In Proceedings of the sixth annual ACM-SIAM symposium on discrete algorithms, pages 48-57, 1995.
Richard F Geary, Naila Rahman, Rajeev Raman, and Venkatesh Raman. A simple optimal representation for balanced parentheses. Theoretical Computer Science, 368(3):231-246, 2006.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014.
R. González, G. Navarro, and H. Ferrada. Locally compressed suffix arrays. ACM Journal of Experimental Algorithmics, 19(1):article 1, 2014.
Stefan Kurtz. Reducing the space requirement of suffix trees. Software-Practice and Experience, 29(13):1149-71, 1999.
Chris-Andre Leimeister and Burkhard Morgenstern. Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000-2008, 2014.
Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Transactions on Algorithms (TALG), 10(3):16, 2014.
Enno Ohlebusch, Johannes Fischer, and Simon Gog. CST++. In International Symposium on String Processing and Information Retrieval, pages 322-333. Springer, 2010.
Enno Ohlebusch and Simon Gog. A compressed enhanced suffix array supporting fast string matching. In International Symposium on String Processing and Information Retrieval, pages 51-62. Springer, 2009.
Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In SPIRE, pages 347-358, 2010.
Daisuke Okanohara and Kunihiko Sadakane. A linear-time Burrows-Wheeler transform using induced sorting. In International Symposium on String Processing and Information Retrieval, pages 90-101. Springer, 2009.
Nicolas Philippe, Mikaël Salson, Thérèse Commes, and Eric Rivals. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology, 14(3):R30, 2013.
Cinzia Pizzi. Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms for Molecular Biology, 11(1):6, 2016.
Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589-607, 2007.
Kunihiko Sadakane. The ultimate balanced parentheses. Technical report, Kyushu University, 2008.
Kunihiko Sadakane and Gonzalo Navarro. Fully-functional succinct trees. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 134-149. SIAM, 2010.
Choon Hui Teo and S.V.N. Vishwanathan. Fast and space efficient string kernels using suffix arrays. In Proceedings of the 23rd international conference on Machine learning, pages 929-936. ACM, 2006.
Sharma V Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472-482, 2016.
Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Ambujam Krishnan, and Srinivas Aluru. A greedy alignment-free distance estimator for phylogenetic inference. BMC bioinformatics, 18(8):238, 2017.
Igor Ulitsky, David Burstein, Tamir Tuller, and Benny Chor. The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology, 13(2):336-350, 2006.
Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973, pages 1-11. IEEE, 1973.

Fast matching statistics in small space

Authors Djamal Belazzougui, Fabio Cunial, Olgert Denas

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Fast matching statistics in small space

Authors Djamal Belazzougui, Fabio Cunial, Olgert Denas

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message