Gapped Indexing for Consecutive Occurrences

Authors Philip Bille , Inge Li Gørtz , Max Rishøj Pedersen , Teresa Anna Steiner



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.10.pdf
  • Filesize: 0.98 MB
  • 19 pages

Document Identifiers

Author Details

Philip Bille
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Inge Li Gørtz
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Max Rishøj Pedersen
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark
Teresa Anna Steiner
  • Technical University of Denmark, DTU Compute, Lyngby, Denmark

Cite AsGet BibTex

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped Indexing for Consecutive Occurrences. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 10:1-10:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CPM.2021.10

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P₁ and P₂ and a gap range [α, β] we can quickly find the consecutive occurrences of P₁ and P₂ with distance in [α, β], i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use Õ(n) space and query time Õ(|P₁|+|P₂|+n^{2/3}) for existence and counting and Õ(|P₁|+|P₂|+n^{2/3}occ^{1/3}) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using Õ(n) space must use Ω̃(|P₁| + |P₂| + √n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
  • Theory of computation → Pattern matching
Keywords
  • String indexing
  • two patterns
  • consecutive occurrences
  • conditional lower bound

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen Alstrup, Jacob Holm, Kristian de Lichtenberg, and Mikkel Thorup. Minimizing diameters of dynamic trees. In Proc. 24th ICALP, pages 270-280, 1997. Google Scholar
  2. Stephen Alstrup, Jacob Holm, and Mikkel Thorup. Maintaining center and median in dynamic trees. In Proc. 7th SWAT, pages 46-56, 2000. Google Scholar
  3. Stephen Alstrup and Theis Rauhe. Improved labeling scheme for ancestor queries. In Proc. 13th SODA, pages 947-953, 2002. Google Scholar
  4. Amihood Amir, Timothy M. Chan, Moshe Lewenstein, and Noa Lewenstein. On hardness of jumbled indexing. In Proc. 41st ICALP, pages 114-125, 2014. Google Scholar
  5. Amihood Amir, Tsvi Kopelowitz, Avivit Levy, Seth Pettie, Ely Porat, and B. Riva Shalom. Mind the gap: Essentially optimal algorithms for online dictionary matching with one gap. In Proc. 27th ISAAC, pages 12:1-12:12, 2016. Google Scholar
  6. Johannes Bader, Simon Gog, and Matthias Petri. Practical variable length gap pattern matching. In Proc. 15th SEA, pages 1-16, 2016. Google Scholar
  7. Philip Bille and Inge Li Gørtz. The tree inclusion problem: In linear space and faster. ACM Trans. Algorithms, 7(3):1-47, 2011. Google Scholar
  8. Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69(2):384-396, 2014. Google Scholar
  9. Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. String Indexing for Top-k Close Consecutive Occurrences. In Proc. 40th FSTTCS, volume 182, pages 14:1-14:17, 2020. Google Scholar
  10. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory Comput. Syst., 55(1):41-60, 2014. Google Scholar
  11. Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching with variable length gaps. Theoret. Comput. Sci., 443, 2012. Announced at SPIRE 2010. Google Scholar
  12. Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V Thankachan. Ranked document retrieval for multiple patterns. Theor. Comput. Sci., 746:98-111, 2018. Google Scholar
  13. P Bucher and A Bairoch. A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In Proc. 2nd ISMB, pages 53-61, 1994. Google Scholar
  14. Manuel Cáceres, Simon J Puglisi, and Bella Zhukova. Fast indexes for gapped pattern matching. In Proc. 46th SOFSEM, pages 493-504, 2020. Google Scholar
  15. Hagai Cohen and Ely Porat. Fast set intersection and two-patterns matching. Theor. Comput. Sci., 411(40-42):3795-3800, 2010. Google Scholar
  16. Paolo Ferragina, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Two-dimensional substring indexing. J. Comput. Syst. Sci., 66(4):763-774, 2003. Google Scholar
  17. Greg N Frederickson. Ambivalent data structures for dynamic 2-edge-connectivity and k smallest spanning trees. SIAM J. Comput., 26(2):484-538, 1997. Google Scholar
  18. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with o(1) worst case access time. J. ACM, 31(3):538-544, 1984. Google Scholar
  19. Kimmo Fredriksson and Szymon Grabowski. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr., 11(4):335-357, 2008. Google Scholar
  20. Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional Lower Bounds for Space/Time Tradeoffs. In Proc. 15th WADS, pages 421-436. Springer, 2017. Google Scholar
  21. Tuukka Haapasalo, Panu Silvasti, Seppo Sippu, and Eljas Soisalon-Soininen. Online dictionary matching with variable-length gaps. In Proc. 10th SEA, pages 76-87, 2011. Google Scholar
  22. K Hofmann, P Bucher, L Falquet, and A Bairoch. The PROSITE database, its status in 1999. Nucleic Acids Res, 27(1):215-219, 1999. Google Scholar
  23. Wing-Kai Hon, Manish Patil, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Indexes for document retrieval with relevance. In Space-Efficient Data Structures, Streams, and Algorithms - Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, pages 351-362, 2013. Google Scholar
  24. Wing-Kai Hon, Manish Patil, Rahul Shah, and Shih-Bin Wu. Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms, 8(4):402-417, 2010. Google Scholar
  25. Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Space-efficient frameworks for top-k string retrieval. J. ACM, 61(2):1-36, 2014. Announced at 50th FOCS. Google Scholar
  26. Wing-Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeffrey Scott Vitter. Faster compressed top-k document retrieval. In Proc. 23rd DCC, pages 341-350, 2013. Google Scholar
  27. Costas S Iliopoulos and M Sohel Rahman. Indexing factors with gaps. Algorithmica, 55(1):60-70, 2009. Google Scholar
  28. Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successive list indexing. In Proc. 11th WADS, pages 625-636, 2007. Google Scholar
  29. Tsvi Kopelowitz, Seth Pettie, and Ely Porat. Higher lower bounds from the 3sum conjecture. In Proc. 27th SODA, pages 1272-1287, 2016. Google Scholar
  30. Kasper Green Larsen, J Ian Munro, Jesper Sindahl Nielsen, and Sharma V Thankachan. On hardness of several string indexing problems. Theoret. Comput. Sci., 582:74-82, 2015. Google Scholar
  31. Moshe Lewenstein. Indexing with gaps. In Proc. 18th SPIRE, pages 135-143, 2011. Google Scholar
  32. Gerhard Mehldau and Gene Myers. A system for pattern matching applications on biosequences. Bioinformatics, 9(3):299-314, 1993. Google Scholar
  33. J. Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma V. Thankachan. Top-k term-proximity in succinct space. Algorithmica, 78(2):379-393, 2017. Announced at 25th ISAAC. Google Scholar
  34. J. Ian Munro, Gonzalo Navarro, Rahul Shah, and Sharma V. Thankachan. Ranked document selection. Theor. Comput. Sci., 812:149-159, 2020. Google Scholar
  35. Eugene W. Myers. Approximate matching of network expressions with spacers. J. Comput. Bio., 3(1):33-51, 1992. Google Scholar
  36. Gonzalo Navarro. Spaces, trees, and colors: The algorithmic landscape of document retrieval on sequences. ACM Comput. Surv., 46(4):1-47, 2014. Google Scholar
  37. Gonzalo Navarro and Yakov Nekrich. Time-optimal top-k document retrieval. SIAM J. Comput., 46(1):80-113, 2017. Announced at 23rd SODA. Google Scholar
  38. Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Bio., 10(6):903-923, 2003. Google Scholar
  39. Gonzalo Navarro and Sharma V. Thankachan. New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci., 542:83-97, 2014. Announced at 20th SPIRE. Google Scholar
  40. Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci., 638:108-111, 2016. Announced at 26th CPM. Google Scholar
  41. Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Proc 13th SWAT, pages 271-282, 2012. Google Scholar
  42. Rahul Shah, Cheng Sheng, Sharma V. Thankachan, and Jeffrey Scott Vitter. Top-k document retrieval in external memory. In Proc. 21st ESA, pages 803-814, 2013. Google Scholar
  43. Dekel Tsur. Top-k document retrieval in optimal space. Inf. Process. Lett., 113(12):440-443, 2013. Google Scholar
  44. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th FOCS, pages 1-11, 1973. Google Scholar
  45. Gelin Zhou. Two-dimensional range successor in optimal time and almost linear space. Inf. Process. Lett., 116(2):171-174, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail