Fast and Linear-Time String Matching Algorithms Based on the Distances of q-Gram Occurrences

Kobayashi, Satoshi; Hendrian, Diptarama; Yoshinaka, Ryo; Shinohara, Ayumi

doi:10.4230/LIPIcs.SEA.2020.13

Abstract

Given a text T of length n and a pattern P of length m, the string matching problem is a task to find all occurrences of P in T. In this study, we propose an algorithm that solves this problem in O((n + m)q) time considering the distance between two adjacent occurrences of the same q-gram contained in P. We also propose a theoretical improvement of it which runs in O(n + m) time, though it is not necessarily faster in practice. We compare the execution times of our and existing algorithms on various kinds of real and artificial datasets such as an English text, a genome sequence and a Fibonacci string. The experimental results show that our algorithm is as fast as the state-of-the-art algorithms in many cases, particularly when a pattern frequently appears in a text.

Cite As Get BibTex

Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, and Ayumi Shinohara. Fast and Linear-Time String Matching Algorithms Based on the Distances of q-Gram Occurrences. In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 13:1-13:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.SEA.2020.13

Author Details

Satoshi Kobayashi

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Diptarama Hendrian

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Ryo Yoshinaka

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Ayumi Shinohara

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Funding

Hendrian, Diptarama: Supported by JSPS KAKENHI Grant Number JP19K20208.

Supplementary Materials

The implementations of our algorithms are available at https://github.com/ushitora/distq.

References

Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Factor oracle: A new structure for pattern matching. In Jan Pavelka, Gerard Tel, and Miroslav Bartošek, editors, SOFSEM'99: Theory and Practice of Informatics, pages 295-310. Springer Berlin Heidelberg, 1999.
Ross Arnold and Tim Bell. A corpus for the evaluation of lossless compression algorithms. In Proceedings of DCC '97. Data Compression Conference, pages 201-210, 1997.
Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762-772, 1977. URL: https://doi.org/10.1145/359842.359859.
Domenico Cantone and Simone Faro. Searching for a substring with constant extra-space complexity. In Proceedings of Third International Conference on Fun with algorithms, pages 118-131, 2004.
Domenico Cantone and Simone Faro. Fast-search algorithms: New efficient variants of the Boyer-Moore pattern-matching algorithm. Journal of Automata, Languages and Combinatorics, 10:589-608, 2005. URL: https://doi.org/10.1007/3-540-44867-5_4.
Domenico Cantone and Simone Faro. Improved and self-tuned occurrence heuristics. Journal of Discrete Algorithms, 28:73-84, 2014. URL: https://doi.org/10.1016/j.jda.2014.07.006.
Domenico Cantone, Simone Faro, and Emanuele Giaquinta. A compact representation of nondeterministic (suffix) automata for the bit-parallel approach. Information and Computation, 213:3-12, 2012. Special Issue: Combinatorial Pattern Matching (CPM 2010). URL: https://doi.org/10.1016/j.ic.2011.03.006.
Domenico Cantone, Simone Faro, and Arianna Pavone. Linear and Efficient String Matching Algorithms Based on Weak Factor Recognition. Journal of Experimental Algorithmics, 24(1):1-20, 2019. URL: https://doi.org/10.1145/3301295.
Simone Faro. A very fast string matching algorithm based on condensed alphabets. In Riccardo Dondi, Guillaume Fertin, and Giancarlo Mauri, editors, Algorithmic Aspects in Information and Management - 11th International Conference, AAIM 2016, Bergamo, Italy, July 18-20, 2016, Proceedings, volume 9778 of Lecture Notes in Computer Science, pages 65-76. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-41168-2_6.
Simone Faro and Thierry Lecroq. A fast suffix automata based algorithm for exact online string matching. In Nelma Moreira and Rogério Reis, editors, Implementation and Application of Automata, pages 149-158. Springer Berlin Heidelberg, 2012.
Simone Faro and Thierry Lecroq. A multiple sliding windows approach to speed up string matching algorithms. In Ralf Klasing, editor, Experimental Algorithms, pages 172-183. Springer Berlin Heidelberg, 2012.
Simone Faro and Thierry Lecroq. The exact online string matching problem. ACM Computing Surveys, 45(2):1-42, 2013. URL: https://doi.org/10.1145/2431211.2431212.
Simone Faro, Thierry Lecroq, Stefano Borzì, Simone Di Mauro, and Alessandro Maggio. The string matching algorithms research tool. In Jan Holub and Jan Žďárek, editors, Proceedings of the Prague Stringology Conference 2016, pages 99-113, Czech Technical University in Prague, Czech Republic, 2016.
Frantisek Franek, Christopher G. Jennings, and W.F. Smyth. A simple fast hybrid pattern-matching algorithm. Journal of Discrete Algorithms, 5(4):682-695, 2007. URL: https://doi.org/10.1016/J.JDA.2006.11.004.
Saqib I. Hakak, Amirrudin Kamsin, Palaiahnakote Shivakumara, Gulshan A. Gilkar, Wazir Z. Khan, and Muhammad Imran. Exact string matching algorithms: Survey, issues, and future research directions. IEEE Access, 7:69614-69637, 2019.
R. Nigel Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501-506, 1980. URL: https://doi.org/10.1002/spe.4380100608.
Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
Satoshi Kobayashi, Diptarama Hendrian, Ryo Yoshinaka, and Ayumi Shinohara. An improvement of the Franek-Jennings-Smyth pattern matching algorithm. In Proceedings of the Prague Stringology Conference 2019, pages 56-68, 2019.
Thierry Lecroq. Fast exact string matching algorithms. Information Processing Letters, 102(6):229-235, 2007. URL: https://doi.org/10.1016/j.ipl.2007.01.002.
Gonzalo Navarro and Mathieu Raffinot. A bit-parallel approach to suffix automata: Fast extended string matching. In Martin Farach-Colton, editor, Combinatorial Pattern Matching, pages 14-33. Springer Berlin Heidelberg, 1998.
Daniel M. Sunday and Daniel M. A very fast substring search algorithm. Communications of the ACM, 33(8):132-142, 1990. URL: https://doi.org/10.1145/79173.79184.
Sun Wu and Udi Manber. A fast algorithm for multi-pattern searching. Technical Report TR-94-17, Department of Computer Science, Chung-Cheng University, 1994.

Fast and Linear-Time String Matching Algorithms Based on the Distances of q-Gram Occurrences

Authors Satoshi Kobayashi, Diptarama Hendrian , Ryo Yoshinaka , Ayumi Shinohara

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message