The Great Textual Hoax: Boosting Sampled String Matching with Fake Samples

Faro, Simone; Marino, Francesco Pio; Moschetto, Andrea; Pavone, Arianna; Scardace, Antonio

doi:10.4230/LIPIcs.FUN.2024.13

File

LIPIcs.FUN.2024.13.pdf

Filesize: 0.79 MB
17 pages

Document Identifiers

DOI: 10.4230/LIPIcs.FUN.2024.13
URN: urn:nbn:de:0030-drops-199211

Author Details

Simone Faro

Department of Mathematics and Computer Science, University of Catania, Italy

Francesco Pio Marino

Department of Mathematics and Computer Science, University of Catania, Italy
Univ Rouen Normandie, INSA Rouen Normandie, Université Le Havre Normandie, Normandie Univ, LITIS UR 4108, CNRS NormaSTIC FR 3638, IRIB, Rouen, F-76000, France

Andrea Moschetto

Department of Mathematics and Computer Science, University of Catania, Italy

Arianna Pavone

Department of Mathematics and Computer Science, University of Palermo, Italy

Antonio Scardace

Department of Mathematics and Computer Science, University of Catania, Italy

Cite AsGet BibTex

Simone Faro, Francesco Pio Marino, Andrea Moschetto, Arianna Pavone, and Antonio Scardace. The Great Textual Hoax: Boosting Sampled String Matching with Fake Samples. In 12th International Conference on Fun with Algorithms (FUN 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 291, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.FUN.2024.13

Abstract

Sampled String Matching is presented as an efficient solution to the string matching problem, aiming to tackle the space constraints of indexed string matching while purportedly reducing search times for online solutions. Despite the problem’s inception dating back to 1991, practical solutions have only recently emerged. These purportedly accelerate online searches by up to 35 times compared to conventional methods, achieved through a partial index occupying a mere 5% of the text size. This paper delves into the intricacies of one of the latest and most effective text sampling techniques, character distance sampling, which revolves around sampling distances between characters of a selected alphabet within the text. Specifically, we introduce fake samples while remaining honest! In other words, the study reveals that, interestingly, strategically introducing fake samples within the sampled sequence slashes the required index space by almost half, just avoid compromising the algorithm’s correctness. Additionally, since efficiency is everything, this approach, in turn, purportedly enhances the algorithm’s efficiency under specific conditions.

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching
Information systems → Information retrieval

Keywords

string matching
sampling

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alberto Apostolico. The myriad virtues of subword trees. In Alberto Apostolico and Zvi Galil, editors, Combinatorial Algorithms on Words, pages 85-96, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg.
Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commun. ACM, 20(10):762-772, October 1977. URL: https://doi.org/10.1145/359842.359859.
Francisco Claude Faust, Gonzalo Navarro, Hannu Peltola, Leena Salmela, and Jorma Tarhio. String matching with alphabet sampling. Journal of Discrete Algorithms, 11, December 2010. URL: https://doi.org/10.1016/j.jda.2010.09.004.
M. Crochemore. Speeding up two string-matching algorithms. Algorithmica, 12(4):247-267, 1994. URL: https://doi.org/10.1007/BF01185427.
Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2), March 2013. URL: https://doi.org/10.1145/2431211.2431212.
Simone Faro, Thierry Lecroq, Stefano Borzi, Simone Di Mauro, and Alessandro Maggio. The string matching algorithms research tool. In Proceedings of the Prague Stringology Conference 2016, pages 99-111. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2016. URL: http://www.stringology.org/event/2016/p09.html.
Simone Faro and Francesco Pio Marino. Reducing time and space in indexed string matching by characters distance text sampling. In Jan Holub and Jan Zdárek, editors, Prague Stringology Conference 2020, Prague, Czech Republic, August 31 - September 2, 2020, pages 148-159. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2020. URL: http://www.stringology.org/event/2020/p13.html.
Simone Faro, Francesco Pio Marino, and Arianna Pavone. Efficient online string matching based on characters distance text sampling. Algorithmica, 82(11):3390-3412, 2020. URL: https://doi.org/10.1007/S00453-020-00732-4.
Simone Faro, Francesco Pio Marino, and Arianna Pavone. Enhancing characters distance text sampling by condensed alphabets. In Claudio Sacerdoti Coen and Ivano Salvo, editors, Proceedings of the 22nd Italian Conference on Theoretical Computer Science, Bologna, Italy, September 13-15, 2021, volume 3072 of CEUR Workshop Proceedings, pages 1-15. CEUR-WS.org, 2021. URL: https://ceur-ws.org/Vol-3072/paper1.pdf.
Simone Faro, Francesco Pio Marino, and Arianna Pavone. Improved characters distance sampling for online and offline text searching. Theor. Comput. Sci., 946:113684, 2023. URL: https://doi.org/10.1016/J.TCS.2022.12.034.
Simone Faro, Francesco Pio Marino, Arianna Pavone, and Antonio Scardace. Towards an efficient text sampling approach for exact and approximate matching. In Jan Holub and Jan Zdárek, editors, Prague Stringology Conference 2021, Prague, Czech Republic, August 30-31, 2021, pages 75-89. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2021. URL: http://www.stringology.org/event/2021/p07.html.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, July 2005. URL: https://doi.org/10.1145/1082036.1082039.
R. Nigel Horspool. Practical fast searching in strings. Software: Practice and Experience, 10(6):501-506, 1980. URL: https://doi.org/10.1002/spe.4380100608.
Juha Kärkkäinen and Esko Ukkonen. Sparse suffix trees. In Jin-Yi Cai and Chak Kuen Wong, editors, Computing and Combinatorics, pages 219-230, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg.
Jinil Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park, Simon J. Puglisi, and Takeshi Tokuyama. Order-preserving matching. Theoretical Computer Science, 525:68-79, 2014. Advances in Stringology. URL: https://doi.org/10.1016/j.tcs.2013.10.006.
Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
Gonzalo Navarro Paolo Ferrigna. Pizza&Chili. Available online: pizzachili.dcc.uchile.cl/, 2005.
Uzi Vishkin. Deterministic sampling–a new technique for fast pattern matching. SIAM Journal on Computing, 20(1):22-40, 1991. URL: https://doi.org/10.1137/0220002.
Andrew Chi-Chih Yao. The complexity of pattern matching for a random string. SIAM Journal on Computing, 8(3):368-387, 1979. URL: https://doi.org/10.1137/0208029.