String Indexing for Top-k Close Consecutive Occurrences

Authors Philip Bille , Inge Li Gørtz , Max Rishøj Pedersen , Eva Rotenberg , Teresa Anna Steiner



PDF
Thumbnail PDF

File

LIPIcs.FSTTCS.2020.14.pdf
  • Filesize: 0.52 MB
  • 17 pages

Document Identifiers

Author Details

Philip Bille
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Inge Li Gørtz
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Max Rishøj Pedersen
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Eva Rotenberg
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Teresa Anna Steiner
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark

Acknowledgements

We thank anonymous reviewers of an earlier draft of this paper for their insightful comments and suggestions for improvement.

Cite AsGet BibTex

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. String Indexing for Top-k Close Consecutive Occurrences. In 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 182, pp. 14:1-14:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.FSTTCS.2020.14

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ε ∈ (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+ε}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Data structures design and analysis
Keywords
  • String indexing
  • pattern matching
  • consecutive occurrences

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen Alstrup, Jacob Holm, Kristian de Lichtenberg, and Mikkel Thorup. Minimizing diameters of dynamic trees. In Proc. 24th ICALP, pages 270-280, 1997. Google Scholar
  2. Stephen Alstrup, Jacob Holm, and Mikkel Thorup. Maintaining center and median in dynamic trees. In Proc. 7th SWAT, pages 46-56, 2000. Google Scholar
  3. Stephen Alstrup and Theis Rauhe. Improved labeling scheme for ancestor queries. In Proc. 13th SODA, pages 947-953, 2002. Google Scholar
  4. Johannes Bader, Simon Gog, and Matthias Petri. Practical variable length gap pattern matching. In Proc. 15th SEA, pages 1-16, 2016. Google Scholar
  5. Philip Bille and Inge Li Gørtz. The tree inclusion problem: In linear space and faster. ACM Trans. Algorithms, 7(3):1-47, 2011. Google Scholar
  6. Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69(2):384-396, 2014. Google Scholar
  7. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41-60, 2014. Google Scholar
  8. Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V Thankachan. Ranked document retrieval for multiple patterns. Theor. Comput. Sci., 746:98-111, 2018. Google Scholar
  9. Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. J. Comput. Syst. Sci., 7(4):448–461, 1973. Google Scholar
  10. Gerth Stølting Brodal, Rolf Fagerberg, Mark Greve, and Alejandro López-Ortiz. Online sorted range reporting. In Proc. 30th ISAAC, pages 173-182, 2009. Google Scholar
  11. Manuel Cáceres, Simon J Puglisi, and Bella Zhukova. Fast indexes for gapped pattern matching. In Proc. 46th SOFSEM, pages 493-504, 2020. Google Scholar
  12. Timothy M Chan. Persistent predecessor search and orthogonal point location on the word ram. ACM Trans. Algorithms, 9(3):1-22, 2013. Google Scholar
  13. Hagai Cohen and Ely Porat. Range non-overlapping indexing. In Proc. 20th ISAAC, pages 1044-1053, 2009. Google Scholar
  14. James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. J. Comput. Syst. Sci., 38(1):86-124, 1989. Google Scholar
  15. Greg N Frederickson. Ambivalent data structures for dynamic 2-edge-connectivity and k smallest spanning trees. SIAM J. Comput., 26(2):484-538, 1997. Google Scholar
  16. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538-544, 1984. Google Scholar
  17. Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Succinct non-overlapping indexing. Algorithmica, 82(1):107-117, 2020. Google Scholar
  18. Wing-Kai Hon, Manish Patil, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Indexes for document retrieval with relevance. In Space-Efficient Data Structures, Streams, and Algorithms - Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, pages 351-362, 2013. Google Scholar
  19. Wing-Kai Hon, Manish Patil, Rahul Shah, and Shih-Bin Wu. Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms, 8(4):402-417, 2010. Google Scholar
  20. Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Space-efficient frameworks for top-k string retrieval. J. ACM, 61(2):1-36, 2014. Announced at 50th FOCS. Google Scholar
  21. Wing-Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeffrey Scott Vitter. Faster compressed top-k document retrieval. In Proc. 23rd DCC, pages 341-350, 2013. Google Scholar
  22. Sahar Hooshmand, Paniz Abedin, M. Oguzhan Külekci, and Sharma V. Thankachan. Non-overlapping indexing - cache obliviously. In Proc. 29th CPM, pages 8:1-8:9, 2018. Google Scholar
  23. Costas S Iliopoulos and M Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60-70, 2009. Google Scholar
  24. Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successive list indexing. In Proc. 11th WADS, pages 625-636, 2007. Google Scholar
  25. Moshe Lewenstein. Indexing with gaps. In Proc. 18th SPIRE, pages 135-143, 2011. Google Scholar
  26. J. Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma V. Thankachan. Top-k term-proximity in succinct space. Algorithmica, 78(2):379-393, 2017. Announced at 25th ISAAC. Google Scholar
  27. J. Ian Munro, Gonzalo Navarro, Rahul Shah, and Sharma V. Thankachan. Ranked document selection. Theor. Comput. Sci., 812:149-159, 2020. Google Scholar
  28. Gonzalo Navarro. Spaces, trees, and colors: The algorithmic landscape of document retrieval on sequences. ACM Comput. Surv., 46(4):1-47, 2014. Google Scholar
  29. Gonzalo Navarro and Yakov Nekrich. Time-optimal top-k document retrieval. SIAM J. Comput., 46(1):80-113, 2017. Announced at 23rd SODA. Google Scholar
  30. Gonzalo Navarro and Sharma V. Thankachan. New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci., 542:83-97, 2014. Announced at 20th SPIRE. Google Scholar
  31. Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. In Proc. 26th CPM, pages 367-373, 2015. Google Scholar
  32. Rahul Shah, Cheng Sheng, Sharma V. Thankachan, and Jeffrey Scott Vitter. Top-k document retrieval in external memory. In Proc. 21st ESA, pages 803-814, 2013. Google Scholar
  33. Daniel D. Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362-391, 1983. Google Scholar
  34. Dekel Tsur. Top-k document retrieval in optimal space. Inf. Process. Lett., 113(12):440-443, 2013. Google Scholar
  35. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th FOCS, pages 1-11, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail