Efficient Exact Online String Matching Through Linked Weak Factors

Authors Matthew N. Palmer, Simone Faro , Stefano Scafiti



PDF
Thumbnail PDF

File

LIPIcs.SEA.2024.24.pdf
  • Filesize: 0.76 MB
  • 14 pages

Document Identifiers

Author Details

Matthew N. Palmer
  • The British Computer Society, Swindon, United Kingdom
Simone Faro
  • Department of Mathematics and Computer Science, University of Catania, Italy
Stefano Scafiti
  • Department of Mathematics and Computer Science, University of Catania, Italy

Cite AsGet BibTex

Matthew N. Palmer, Simone Faro, and Stefano Scafiti. Efficient Exact Online String Matching Through Linked Weak Factors. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 24:1-24:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SEA.2024.24

Abstract

Online exact string matching is a fundamental computational problem in computer science, involving the sequential search for a pattern within a large text without prior access to the entire text. Its significance is underscored by its diverse applications in data compression, data mining, text editing, and bioinformatics, just to cite a few, where efficient substring matching is crucial. While the problem has been a subject of study for years, recent decades have witnessed a heightened focus on experimental solutions, employing various techniques to achieve superior performance. Notably, approaches centered around weak factor recognition have emerged as leaders in experimental settings, gaining increasing attention. This paper introduces Hash Chain, a novel algorithm founded on a robust weak factor recognition approach that links adjacent factors through hashing. Building upon the efficacy of weak recognition techniques, the proposed algorithm incorporates innovative strategies for organizing data structures and optimizations to enhance performance. Despite its quadratic worst-case time complexity, the new proposed algorithm demonstrates sublinear behavior in practice, outperforming currently known algorithms in the literature.

Subject Classification

ACM Subject Classification
  • Theory of computation → Bloom filters and hashing
  • Theory of computation → Pattern matching
Keywords
  • String matching
  • text processing
  • weak recognition
  • hashing
  • experimental algorithms
  • design and analysis of algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Factor oracle: A new structure for pattern matching. In Jan Pavelka, Gerard Tel, and Miroslav Bartosek, editors, SOFSEM '99, Theory and Practice of Informatics, 26th Conference on Current Trends in Theory and Practice of Informatics, Milovy, Czech Republic, November 27 - December 4, 1999, Proceedings, volume 1725 of Lecture Notes in Computer Science, pages 295-310. Springer, 1999. URL: https://doi.org/10.1007/3-540-47849-3_18.
  2. Ricardo A. Baeza-Yates and Gaston H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74-82, 1992. URL: https://doi.org/10.1145/135239.135243.
  3. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422-426, 1970. URL: https://doi.org/10.1145/362686.362692.
  4. Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10), 1977. URL: https://doi.org/10.1145/359842.359859.
  5. Domenico Cantone and Simone Faro. Improved and self-tuned occurrence heuristics. J. Discrete Algorithms, 28:73-84, 2014. Google Scholar
  6. Domenico Cantone, Simone Faro, and Emanuele Giaquinta. A compact representation of nondeterministic (suffix) automata for the bit-parallel approach. Inf. Comput., 213:3-12, 2012. URL: https://doi.org/10.1016/j.ic.2011.03.006.
  7. Domenico Cantone, Simone Faro, and Arianna Pavone. Speeding up string matching by weak factor recognition. In Proc. of the Prague Stringology Conference 2017, pages 42-50, 2017. URL: http://www.stringology.org/event/2017/p05.html.
  8. Domenico Cantone, Simone Faro, and Arianna Pavone. Linear and efficient string matching algorithms based on weak factor recognition. ACM J. Exp. Algorithmics, 24(1):1.8:1-1.8:20, 2019. URL: https://doi.org/10.1145/3301295.
  9. Maxime Crochemore and Wojciech Rytter. Text Algorithms. Oxford University Press, 1994. URL: http://www-igm.univ-mlv.fr/%7Emac/REC/B1.html.
  10. Branislav Ďurian, Hannu Peltola, Leena Salmela, and Jorma Tarhio. Bit-parallel search algorithms for long patterns. In Paola Festa, editor, Experimental Algorithms, pages 129-140, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. Google Scholar
  11. Simone Faro. A very fast string matching algorithm based on condensed alphabets. In Riccardo Dondi, Guillaume Fertin, and Giancarlo Mauri, editors, Algorithmic Aspects in Information and Management - 11th International Conference, AAIM 2016, Bergamo, Italy, July 18-20, 2016, Proceedings, volume 9778 of Lecture Notes in Computer Science, pages 65-76. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-41168-2_6.
  12. Simone Faro and M. Oguzhan Külekci. Fast and flexible packed string matching. J. Discrete Algorithms, 28:61-72, 2014. URL: https://doi.org/10.1016/j.jda.2014.07.003.
  13. Simone Faro and Thierry Lecroq. Efficient variants of the backward-oracle-matching algorithm. Int. J. Found. Comput. Sci., 20:967-984, 2009. Google Scholar
  14. Simone Faro and Thierry Lecroq. A fast suffix automata based algorithm for exact online string matching. In Implementation and Application of Automata - 17th International Conference, CIAA 2012, volume 7381 of LNCS, pages 149-158. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-31606-7_13.
  15. Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2):13:1-13:42, 2013. URL: https://doi.org/10.1145/2431211.2431212.
  16. Simone Faro, Thierry Lecroq, Stefano Borzi, Simone Di Mauro, and Alessandro Maggio. The string matching algorithms research tool. In Proc. of the Prague Stringology Conference 2016, pages 99-111, 2016. URL: http://www.stringology.org/event/2016/p09.html.
  17. Simone Faro and Stefano Scafiti. Efficient string matching based on a two-step simulation of the suffix automaton. In Sebastian Maneth, editor, Implementation and Application of Automata, pages 165-177, Cham, 2021. Springer International Publishing. Google Scholar
  18. Simone Faro and Stefano Scafiti. A weak approach to suffix automata simulation for exact and approximate string matching. Theor. Comput. Sci., 933:88-103, 2022. URL: https://doi.org/10.1016/j.tcs.2022.08.028.
  19. Simone Faro and Stefano Scafiti. Compact suffix automata representations for searching long patterns. Theor. Comput. Sci., 940(Part):254-268, 2023. URL: https://doi.org/10.1016/j.tcs.2022.11.005.
  20. Frantisek Franek, Christopher G. Jennings, and William F. Smyth. A simple fast hybrid pattern-matching algorithm. J. Discrete Algorithms, 5:682-695, 2005. Google Scholar
  21. Kimmo Fredriksson and Szymon Grabowski. Practical and optimal string matching. In SPIRE, volume 3772, pages 376-387, November 2005. URL: https://doi.org/10.1007/11575832_42.
  22. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  23. Thierry Lecroq. Fast exact string matching algorithms. Inf. Process. Lett., 102(6):229-235, May 2007. URL: https://doi.org/10.1016/j.ipl.2007.01.002.
  24. Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. In Combinatorial Pattern Matching, 9th Annual Symposium, CPM 98, volume 1448 of LNCS, pages 14-33. Springer, 1998. URL: https://doi.org/10.1007/BFb0030778.
  25. Hannu Peltola and Jorma Tarhio. Alternative algorithms for bit-parallel string matching. In String Processing and Information Retrieval, 10th International Symposium, SPIRE 2003, volume 2857 of LNCS, pages 80-94. Springer, 2003. URL: https://doi.org/10.1007/978-3-540-39984-1_7.
  26. Noriyoshi Uratani and Masayuki Takeda. A fast string-searching algorithm for multiple patterns. Inf. Process. Manag., 29(6):775-792, 1993. URL: https://doi.org/10.1016/0306-4573(93)90106-N.
  27. Andrew C Yao. The complexity of pattern matching for a random string. Technical report, Stanford University, Stanford, CA, USA, 1977. Google Scholar
  28. Branislav Ďurian, Jan Holub, Hannu Peltola, and Jorma Tarhio. Tuning BNDM with q-Grams, pages 29-37. Society for Industrial and Applied Mathematics, 2009. URL: https://doi.org/10.1137/1.9781611972894.3.