Faster Sliding Window String Indexing in Streams

Authors Philip Bille , Paweł Gawrychowski , Inge Li Gørtz , Simon R. Tarnow



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.8.pdf
  • Filesize: 1.09 MB
  • 14 pages

Document Identifiers

Author Details

Philip Bille
  • Technical University of Denmark, Lyngby, Denmark
Paweł Gawrychowski
  • Institute of Computer Science, University of Wrocław, Poland
Inge Li Gørtz
  • Technical University of Denmark, Lyngby, Denmark
Simon R. Tarnow
  • Technical University of Denmark, Lyngby, Denmark

Cite AsGet BibTex

Philip Bille, Paweł Gawrychowski, Inge Li Gørtz, and Simon R. Tarnow. Faster Sliding Window String Indexing in Streams. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 8:1-8:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.8

Abstract

The classical string indexing problem asks to preprocess the input string S for efficient pattern matching queries. Bille, Fischer, Gørtz, Pedersen, and Stordalen [CPM 2023] generalized this to the {streaming sliding window string indexing} problem, where the input string S arrives as a stream, and we are asked to maintain an index of the last w characters, called the window. Further, at any point in time, a pattern P might appear, again given as a stream, and all occurrences of P in the current window must be output. We require that the time to process each character of the text or the pattern is worst-case. It appears that standard string indexing structures, such as suffix trees, do not provide an efficient solution in such a setting, as to obtain a good worst-case bound, they necessarily need to work right-to-left, and we cannot reverse the pattern while keeping a worst-case guarantee on the time to process each of its characters. Nevertheless, it is possible to obtain a bound of 𝒪(log w) (with high probability) by maintaining a hierarchical structure of multiple suffix trees. We significantly improve this upper bound by designing a black-box reduction to maintain a suffix tree under prepending characters to the current text. By plugging in the known results, this allows us to obtain a bound of 𝒪(log log w +log log σ) (with high probability), where σ is the size of the alphabet. Further, we introduce an even more general problem, called the {streaming dynamic window string indexing}, where the goal is to maintain the current text under adding and deleting characters at either end and design a similar black-box reduction.

Subject Classification

ACM Subject Classification
  • Theory of computation → Streaming, sublinear and near linear time algorithms
Keywords
  • data structures
  • pattern matching
  • string indexing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amihood Amir, Gianni Franceschini, Roberto Grossi, Tsvi Kopelowitz, Moshe Lewenstein, and Noa Lewenstein. Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to Online Indexing. SIAM J. Comput., 43(4):1396-1416, 2014. URL: https://doi.org/10.1137/110836377.
  2. Amihood Amir, Tsvi Kopelowitz, Moshe Lewenstein, and Noa Lewenstein. Towards real-time suffix tree construction. In Proc. 12th SPIRE, volume 3772, pages 67-78. Springer, 2005. URL: https://doi.org/10.1007/11575832_9.
  3. Amihood Amir and Igor Nor. Real-time indexing over fixed finite alphabets. In Proc. 19th SODA, pages 1086-1095, 2008. URL: http://dl.acm.org/citation.cfm?id=1347082.1347201.
  4. Arne Andersson and Mikkel Thorup. Dynamic ordered sets with exponential search trees. J. ACM, 54(3):13, 2007. Google Scholar
  5. Michael A. Bender and Martín Farach-Colton. The lca problem revisited. In Gaston H. Gonnet and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, pages 88-94, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg. Google Scholar
  6. Philip Bille, Johannes Fischer, Inge Li Gørtz, Max Rishøj Pedersen, and Tord Joakim Stordalen. Sliding window string indexing in streams. In Proc. 34th CPM, pages 4:1-4:18, 2023. URL: https://doi.org/10.4230/LIPICS.CPM.2023.4.
  7. Dany Breslauer and Giuseppe F. Italiano. Near real-time suffix tree construction via the fringe marked ancestor problem. J. Discrete Algorithms, 18:32-48, 2013. URL: https://doi.org/10.1016/j.jda.2012.07.003.
  8. Andrej Brodnik and Matevz Jekovec. Sliding suffix tree. Algorithms, 11(8):118, 2018. URL: https://doi.org/10.3390/a11080118.
  9. Tyng-Ruey Chuang and Benjamin Goldberg. Real-time deques, multihead thring machines, and purely functional programming. In Proc. 6th FPCA, pages 289-298, 1993. Google Scholar
  10. Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. In Proc. 17th ICALP, pages 6-19, 1990. Google Scholar
  11. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987-1011, 2000. Google Scholar
  12. Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Commun. ACM, 32(4):490-505, 1989. URL: https://doi.org/10.1145/63334.63341.
  13. N. J. Fine and H. S. Wilf. Uniqueness theorems for periodic functions. Proc. Amer. Math. Soc., 16(1):109-114, 1965. Google Scholar
  14. Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proc. 26th CPM, pages 160-171, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_14.
  15. Rob R. Hoogerwoord. A symmetric set of efficient list operations. J. Funct. Program., 2(4):505-513, 1992. Google Scholar
  16. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  17. Tsvi Kopelowitz. On-line indexing for general alphabets via predecessor queries on subsets of an ordered list. In Proc. 53rd FOCS, pages 283-292, 2012. URL: https://doi.org/10.1109/FOCS.2012.79.
  18. S. Rao Kosaraju. Real-time pattern matching and quasi-real-time construction of suffix trees (preliminary version). In Proc. 26th STOC, pages 310-316, 1994. URL: https://doi.org/10.1145/195058.195170.
  19. Gregory Kucherov and Yakov Nekrich. Full-Fledged Real-Time Indexing for Constant Size Alphabets. Algorithmica, 79(2):387-400, 2017. URL: https://doi.org/10.1007/s00453-016-0199-7.
  20. N. Jesper Larsson. Structures of String Matching and Data Compression. PhD thesis, Lund University, Sweden, 1999. URL: http://lup.lub.lu.se/record/19255.
  21. Joong Chae Na, Alberto Apostolico, Costas S. Iliopoulos, and Kunsoo Park. Truncated suffix trees and their application to data compression. Theor. Comput. Sci., 304(1-3):87-101, 2003. URL: https://doi.org/10.1016/S0304-3975(03)00053-7.
  22. Martin Senft and Tomás Dvorák. Sliding CDAWG perfection. In Proc. 15th SPIRE, pages 109-120, 2008. URL: https://doi.org/10.1007/978-3-540-89097-3_12.
  23. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th SWAT, pages 1-11, 1973. Google Scholar