Practical Performance of Space Efficient Data Structures for Longest Common Extensions

Authors Patrick Dinklage , Johannes Fischer, Alexander Herlez, Tomasz Kociumaka , Florian Kurpicz



PDF
Thumbnail PDF

File

LIPIcs.ESA.2020.39.pdf
  • Filesize: 0.68 MB
  • 20 pages

Document Identifiers

Author Details

Patrick Dinklage
  • Department of Computer Science, Technical University of Dortmund, Germany
Johannes Fischer
  • Department of Computer Science, Technical University of Dortmund, Germany
Alexander Herlez
  • Department of Computer Science, Technical University of Dortmund, Germany
Tomasz Kociumaka
  • Department of Computer Science, Bar-Ilan Unviersity, Ramat Gan, Israel
Florian Kurpicz
  • Department of Computer Science, Technical University of Dortmund, Germany

Acknowledgements

Part of this work was carried out during the Dagstuhl Seminar 19241, "25 Years of the Burrows - Wheeler Transform".

Cite AsGet BibTex

Patrick Dinklage, Johannes Fischer, Alexander Herlez, Tomasz Kociumaka, and Florian Kurpicz. Practical Performance of Space Efficient Data Structures for Longest Common Extensions. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 39:1-39:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.ESA.2020.39

Abstract

For a text T[1,n], a Longest Common Extension (LCE) query lce_T(i,j) asks for the length of the longest common prefix of the suffixes T[i,n] and T[j,n] identified by their starting positions 1 ≤ i,j ≤ n. A classic problem in stringology asks to preprocess a static text T[1,n] over an alphabet of size σ so that LCE queries can be efficiently answered on-line. Since its introduction in the 1980’s, this problem has found numerous applications: in suffix sorting, edit distance computation, approximate pattern matching, regularities finding, string mining, and many more. Text-book solutions offer O(n) preprocessing time and O(1) query time, but they employ memory-heavy data structures, such as suffix arrays, in practice several times bigger than the text itself. Very recently, more space efficient solutions using O(nlogσ) bits of total space or even only O(log n) bits of extra space have been proposed: string synchronizing sets [Kempa and Kociumaka, STOC'19, and Birenzwige et al., SODA'20] and in-place fingerprinting [Prezza, SODA'18]. The goal of this article is to present well-engineered implementations of these new solutions and study their practicality on a commonly agreed text corpus. We show that both perform extremely well in practice, with space consumption of only around 10% of the input size for string synchronizing sets (around 20% for highly repetitive texts), and essentially no extra space for fingerprinting. Interestingly, our experiments also show that both solutions become much faster than naive scanning even for finding common prefixes of moderate length, contradicting a common belief that sophisticated data structures for LCE queries are not competitive with naive approaches [Ilie and Tinta, SPIRE'09].

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • text indexing
  • longest common prefix
  • space efficient data structures

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen Alstrup, Gerth Stølting Brodal, and Theis Rauhe. Pattern matching in dynamic texts. In 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 819-828. ACM/SIAM, 2000. URL: http://dl.acm.org/citation.cfm?id=338219.338645.
  2. Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2):257-275, 2004. URL: https://doi.org/10.1016/S0196-6774(03)00097-X.
  3. Johannes Bahne, Nico Bertram, Marvin Böcker, Jonas Bode, Johannes Fischer, Hermann Foot, Florian Grieskamp, Florian Kurpicz, Marvin Löbel, Oliver Magiera, Rosa Pink, David Piper, and Christopher Poeplau. Sacabench: Benchmarking suffix array construction. In 26th International Symposium on String Processing and Information Retrieval (SPIRE), volume 11811 of Lecture Notes in Computer Science, pages 407-416. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_29.
  4. Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017. URL: https://doi.org/10.1137/15M1011032.
  5. Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 22:1-22:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.22.
  6. Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, and Inge Li Gørtz. Finger search in grammar-compressed strings. Theory of Computing Systems, 62(8):1715-1735, 2018. URL: https://doi.org/10.1007/s00224-017-9839-9.
  7. Philip Bille, Johannes Fischer, Inge Li Gørtz, Tsvi Kopelowitz, Benjamin Sach, and Hjalte Wedel Vildhøj. Sparse text indexing in small space. ACM Transactions on Algorithms, 12(3):39:1-39:19, 2016. URL: https://doi.org/10.1145/2836166.
  8. Philip Bille, Inge Li Gørtz, Patrick Hagge Cording, Benjamin Sach, Hjalte Wedel Vildhøj, and Søren Vind. Fingerprints in compressed strings. Journal of Computer and System Sciences, 86:171-180, 2017. URL: https://doi.org/10.1016/j.jcss.2017.01.002.
  9. Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In 26th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 9133 of Lecture Notes in Computer Science, pages 65-76. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_6.
  10. Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. Journal of Discrete Algorithms, 25:42-50, 2014. URL: https://doi.org/10.1016/j.jda.2013.06.003.
  11. Timo Bingmann, Andreas Eberle, and Peter Sanders. Engineering parallel string sorting. Algorithmica, 77(1):235-286, 2017. URL: https://doi.org/10.1007/s00453-015-0071-1.
  12. Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 607-626. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
  13. Maxime Crochemore, Roman Kolpakov, and Gregory Kucherov. Optimal bounds for computing α-gapped repeats. Information and Computation, 268, 2019. URL: https://doi.org/10.1016/j.ic.2019.104434.
  14. Roman Dementiev, Lutz Kettner, Jens Mehnert, and Peter Sanders. Engineering a sorted list data structure for 32 bit key. In Sixth Workshop on Algorithm Engineering and Experiments (ALENEX) and the First Workshop on Analytic Algorithmics and Combinatorics (ANALCO), pages 142-151. SIAM, 2004. Google Scholar
  15. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987-1011, 2000. URL: https://doi.org/10.1145/355541.355547.
  16. Héctor Ferrada and Gonzalo Navarro. Improved range minimum queries. Journal of Discrete Algorithms, 43:72-80, 2017. URL: https://doi.org/10.1016/j.jda.2016.09.002.
  17. Paolo Ferragina and Gonzalo Navarro. Pizza&Chili corpus: Compressed indexes and their testbeds. URL: http://pizzachili.dcc.uchile.cl/.
  18. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the rmq-problem, with applications to LCA and LCE. In 17th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 4009 of Lecture Notes in Computer Science, pages 36-48. Springer, 2006. URL: https://doi.org/10.1007/11780441_5.
  19. Johannes Fischer, Tomohiro I, and Dominik Köppl. Deterministic sparse suffix sorting on rewritable texts. In 12th Latin American Symposium on Theoretical Informatics (LATIN), volume 9644 of Lecture Notes in Computer Science, pages 483-496. Springer, 2016. URL: https://doi.org/10.1007/978-3-662-49529-2_36.
  20. Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theoretical Computer Science, 410(51):5354-5364, 2009. URL: https://doi.org/10.1016/j.tcs.2009.09.012.
  21. Johannes Fischer, Veli Mäkinen, and Niko Välimäki. Space efficient string mining under frequency constraints. In 8th IEEE International Conference on Data Mining (ICDM), pages 193-202. IEEE Computer Society, 2008. URL: https://doi.org/10.1109/ICDM.2008.32.
  22. Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4):52-54, 1986. URL: https://doi.org/10.1145/8307.8309.
  23. Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1509-1528. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.99.
  24. Paweł Gawrychowski and Tomasz Kociumaka. Sparse suffix tree construction in optimal time and space. In 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 425-439. SIAM, 2017. URL: https://doi.org/10.1137/1.9781611974782.27.
  25. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Joachim Gudmundsson and Jyrki Katajainen, editors, 13th International Symposium on Experimental Algorithms, SEA 2014, volume 8504 of LNCS, pages 326-337. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  26. Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
  27. Tomohiro I. Longest common extensions with recompression. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 18:1-18:15. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.18.
  28. Tomohiro I, Wataru Matsubara, Kouji Shimohira, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Kazuyuki Narisawa, and Ayumi Shinohara. Detecting regularities on grammar-compressed strings. Information and Computation, 240:74-89, 2015. URL: https://doi.org/10.1016/j.ic.2014.09.009.
  29. Lucian Ilie and Liviu Tinta. Practical algorithms for the longest common extension problem. In 16th International Symposium on String Processing and Information Retrieval (SPIRE), volume 5721 of Lecture Notes in Computer Science, pages 302-309. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-03784-9_30.
  30. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
  31. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
  32. Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
  33. Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. In 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 532-551. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
  34. Dmitry Kosolobov. Tight lower bounds for the longest common extension problem. Information Processing Letters, 125:26-29, 2017. URL: https://doi.org/10.1016/j.ipl.2017.05.003.
  35. Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. SIAM Journal on Computing, 27(2):557-582, 1998. URL: https://doi.org/10.1137/S0097539794264810.
  36. Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239-249, 1986. URL: https://doi.org/10.1016/0304-3975(86)90178-7.
  37. Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. Journal of Computer and System Sciences, 37(1):63-78, 1988. URL: https://doi.org/10.1016/0022-0000(88)90045-1.
  38. Yoshiaki Matsuoka, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Semi-dynamic compact index for short patterns and succinct van Emde Boas tree. In 26th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 9133 of Lecture Notes in Computer Science, pages 355-366. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_30.
  39. Kurt Mehlhorn, R. Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183-198, 1997. URL: https://doi.org/10.1007/BF02522825.
  40. Yuta Mori. sais: An implementation of the induced sorting algorithm. URL: https://sites.google.com/site/yuta256/.
  41. Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016. URL: https://doi.org/10.1017/cbo9781316588284.
  42. Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), volume 58 of LIPIcs, pages 72:1-72:15. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.MFCS.2016.72.
  43. Cinzia Pizzi. Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms for Molecular Biology, 11:6, 2016. URL: https://doi.org/10.1186/s13015-016-0072-x.
  44. Nicola Prezza. In-place sparse suffix sorting. In 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1496-1508. SIAM, 2018. URL: https://doi.org/10.1137/1.9781611975031.98.
  45. Wei Quan, Bo Liu, and Yadong Wang. SALT: a fast, memory-efficient and snp-aware short read alignment tool. In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1774-1779. IEEE, 2019. URL: https://doi.org/10.1109/BIBM47256.2019.8983162.
  46. Süleyman Cenk Sahinalp and Uzi Vishkin. On a parallel-algorithms method for string matching problems. In Second Italian Conference on Algorithms and Complexity (CIAC), volume 778 of LNCS, pages 22-32. Springer, 1994. URL: https://doi.org/10.1007/3-540-57811-0_3.
  47. Avi Srivastava, Hirak Sarkar, Nitish Gupta, and Robert Patro. Rapmap: a rapid, sensitive and accurate tool for mapping rna-seq reads to transcriptomes. Bioinformatics, 32(12):192-200, 2016. URL: https://doi.org/10.1093/bioinformatics/btw277.
  48. Yuka Tanimura, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, Simon J. Puglisi, and Masayuki Takeda. Deterministic sub-linear space LCE data structures with efficient construction. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 54 of LIPIcs, pages 1:1-1:10. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.1.
  49. Peter van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80-82, 1977. URL: https://doi.org/10.1016/0020-0190(77)90031-X.
  50. Chen Zhou, Hao Chi, Leheng Wang, You Li, Yan-Jie Wu, Yan Fu, Ruixiang Sun, and Si-Min He. Speeding up tandem mass spectrometry-based database searching by longest common prefix. BMC Bioinformatics, 11:577, 2010. URL: https://doi.org/10.1186/1471-2105-11-577.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail