Sliding Window String Indexing in Streams

Authors Philip Bille , Johannes Fischer , Inge Li Gørtz , Max Rishøj Pedersen , Tord Joakim Stordalen



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.4.pdf
  • Filesize: 0.89 MB
  • 18 pages

Document Identifiers

Author Details

Philip Bille
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark
Johannes Fischer
  • Department of Computer Science, Technische Universität Dortmund, Germany
Inge Li Gørtz
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark
Max Rishøj Pedersen
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark
Tord Joakim Stordalen
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark

Acknowledgements

We thank the anonymous reviewers for their helpful comments.

Cite AsGet BibTex

Philip Bille, Johannes Fischer, Inge Li Gørtz, Max Rishøj Pedersen, and Tord Joakim Stordalen. Sliding Window String Indexing in Streams. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.4

Abstract

Given a string S over an alphabet Σ, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log (w/δ)). In particular, for a delay of δ = ε w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Data structures design and analysis
Keywords
  • String indexing
  • pattern matching
  • sliding window
  • streaming

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amihood Amir and Itai Boneh. Update query time trade-off for dynamic suffix arrays. In Proc. 31st ISAAC, volume 181, pages 63:1-63:16, 2020. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2020.63.
  2. Amihood Amir and Itai Boneh. Dynamic suffix array with sub-linear update time and poly-logarithmic lookup time. CoRR, abs/2112.12678, 2021. URL: https://arxiv.org/abs/2112.12678.
  3. Amihood Amir, Gianni Franceschini, Roberto Grossi, Tsvi Kopelowitz, Moshe Lewenstein, and Noa Lewenstein. Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to Online Indexing. SIAM J. Comput., 43(4):1396-1416, 2014. URL: https://doi.org/10.1137/110836377.
  4. Amihood Amir, Tsvi Kopelowitz, Moshe Lewenstein, and Noa Lewenstein. Towards real-time suffix tree construction. In Proc. 12th SPIRE, volume 3772, pages 67-78. Springer, 2005. URL: https://doi.org/10.1007/11575832_9.
  5. Amihood Amir and Igor Nor. Real-time indexing over fixed finite alphabets. In Proc. 19th SODA, pages 1086-1095, 2008. URL: http://dl.acm.org/citation.cfm?id=1347082.1347201.
  6. Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic Indexing for Packed Strings. In Proc. 28th CPM, volume 78, pages 6:1-6:11, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.6.
  7. Dany Breslauer and Giuseppe F. Italiano. Near real-time suffix tree construction via the fringe marked ancestor problem. J. Discrete Algorithms, 18:32-48, 2013. URL: https://doi.org/10.1016/j.jda.2012.07.003.
  8. Andrej Brodnik and Matevz Jekovec. Sliding suffix tree. Algorithms, 11(8):118, 2018. URL: https://doi.org/10.3390/a11080118.
  9. Larry Carter and Mark N. Wegman. Universal Classes of Hash Functions. J. Comput. Syst. Sci., 18(2):143-154, 1979. URL: https://doi.org/10.1016/0022-0000(79)90044-8.
  10. Richard Cole, Tsvi Kopelowitz, and Moshe Lewenstein. Suffix Trays and Suffix Trists: Structures for Faster Text Indexing. Algorithmica, 72(2):450-466, 2015. URL: https://doi.org/10.1007/s00453-013-9860-6.
  11. Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. A New Universal Class of Hash Functions and Dynamic Hashing in Real Time. In Proc. 17th ICALP, pages 6-19, 1990. URL: https://doi.org/10.1007/BFb0032018.
  12. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. J. ACM, 47(6):987-1011, 2000. URL: https://doi.org/10.1145/355541.355547.
  13. Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Commun. ACM, 32(4):490-505, 1989. URL: https://doi.org/10.1145/63334.63341.
  14. Johannes Fischer and Pawel Gawrychowski. Alphabet-Dependent String Searching with Wexponential Search Trees. In Proc. 26th CPM, pages 160-171, 2005. URL: https://doi.org/10.1007/978-3-319-19929-0_14.
  15. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. J. ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
  16. Harold N. Gabow, Jon Louis Bentley, and Robert Endre Tarjan. Scaling and related techniques for geometry problems. In Proc. 16th STOC, pages 135-143. ACM, 1984. URL: https://doi.org/10.1145/800057.808675.
  17. Yijie Han. Deterministic sorting in O(nlog log n) time and linear space. In Proc. 34th STOC, pages 602-608, 2002. URL: https://doi.org/10.1145/509907.509993.
  18. Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, and Setsuo Arikawa. Compact directed acyclic word graphs for a sliding window. J. Discrete Algorithms, 2(1):33-51, 2004. URL: https://doi.org/10.1016/S1570-8667(03)00064-9.
  19. Dominik Kempa and Tomasz Kociumaka. Dynamic suffix array with polylogarithmic queries and updates. In Proc. 54th STOC, pages 1657-1670, 2022. URL: https://doi.org/10.1145/3519935.3520061.
  20. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast Pattern Matching in Strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  21. Tsvi Kopelowitz. On-line indexing for general alphabets via predecessor queries on subsets of an ordered list. In 53rd FOCS, pages 283-292, 2012. URL: https://doi.org/10.1109/FOCS.2012.79.
  22. S. Rao Kosaraju. Real-time pattern matching and quasi-real-time construction of suffix trees (preliminary version). In Proc. 26th STOC, pages 310-316, 1994. URL: https://doi.org/10.1145/195058.195170.
  23. Gregory Kucherov and Yakov Nekrich. Full-Fledged Real-Time Indexing for Constant Size Alphabets. Algorithmica, 79(2):387-400, 2017. URL: https://doi.org/10.1007/s00453-016-0199-7.
  24. N. Jesper Larsson. Structures of String Matching and Data Compression. PhD thesis, Lund University, Sweden, 1999. URL: http://lup.lub.lu.se/record/19255.
  25. Joong Chae Na, Alberto Apostolico, Costas S. Iliopoulos, and Kunsoo Park. Truncated suffix trees and their application to data compression. Theor. Comput. Sci., 304(1-3):87-101, 2003. URL: https://doi.org/10.1016/S0304-3975(03)00053-7.
  26. Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 33(4):351-385, 1996. URL: https://doi.org/10.1007/s002360050048.
  27. Mikaël Salson, Thierry Lecroq, Martine Léonard, and Laurent Mouchard. Dynamic extended suffix arrays. J. Discrete Algorithms, 8(2):241-257, 2010. URL: https://doi.org/10.1016/j.jda.2009.02.007.
  28. M Senft. Suffix tree for a sliding window: An overview. In Proc. WDS, volume 5, pages 41-46, 2005. Google Scholar
  29. Martin Senft and Tomás Dvorák. Sliding CDAWG perfection. In Proc. 15th SPIRE, pages 109-120, 2008. URL: https://doi.org/10.1007/978-3-540-89097-3_12.
  30. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
  31. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th SWAT, pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail