Compressed Indexing for Consecutive Occurrences

Authors Paweł Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, Teresa Anna Steiner



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.12.pdf
  • Filesize: 0.81 MB
  • 22 pages

Document Identifiers

Author Details

Paweł Gawrychowski
  • Institute of Computer Science, University of Wrocław, Poland
Garance Gourdel
  • DI/ENS, PSL Research University, IRISA Inria Rennes, France
Tatiana Starikovskaya
  • DI/ENS, PSL Research University, Paris, France
Teresa Anna Steiner
  • DTU Compute, Technical University of Denmark, Lyngby, Denmark

Cite AsGet BibTex

Paweł Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner. Compressed Indexing for Consecutive Occurrences. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 12:1-12:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.12

Abstract

The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern P. However, practical applications motivate the necessity of considering more complex queries, for example concerning near occurrences of two patterns. Recently, Bille et al. [CPM 2021] introduced a variant of such queries, called gapped consecutive occurrences, in which a query consists of two patterns P₁ and P₂ and a range [a,b], and one must find all consecutive occurrences (q₁,q₂) of P₁ and P₂ such that q₂-q₁ ∈ [a,b]. By their results, we cannot hope for a very efficient indexing structure for such queries, even if a = 0 is fixed (although at the same time they provided a non-trivial upper bound). Motivated by this, we focus on a text given as a straight-line program (SLP) and design an index taking space polynomial in the size of the grammar that answers such queries in time optimal up to polylog factors.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Pattern matching
Keywords
  • Compressed indexing
  • two patterns
  • consecutive occurrences

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Arturs Backurs, Karl Bringmann, and Marvin Künnemann. Fine-grained complexity of analyzing compressed data: Quantifying improvements over decompress-and-solve. In Proc. 58th FOCS, pages 192-203, 2017. Google Scholar
  2. Amir Abboud, Arturs Backurs, Karl Bringmann, and Marvin Künnemann. Impossibility results for grammar-compressed linear algebra. In Proc. 34th NeurIPS, pages 8810-8823, 2020. Google Scholar
  3. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In Proc. 18th ESA, pages 427-438, 2010. Google Scholar
  4. Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. In Proc. 32nd CPM, pages 10:1-10:19, 2021. Google Scholar
  5. Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput., 44(3):513-539, 2015. Google Scholar
  6. Timothy M. Chan. Persistent predecessor search and orthogonal point location on the word RAM. ACM Trans. Algorithms, 9(3):22:1-22:22, 2013. Google Scholar
  7. Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, April Rasala, Amit Sahai, and Abhi Shelat. Approximating the smallest grammar: Kolmogorov complexity in natural models. In Proc. 34th STOC, pages 792-801, 2002. Google Scholar
  8. Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1-8:39, 2021. Google Scholar
  9. Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180-192, 2012. Google Scholar
  10. Francisco Claude, Gonzalo Navarro, and Alejandro Pacheco. Grammar-compressed indexes with logarithmic search time. J. Comput. Syst. Sci., 118:53-74, 2021. Google Scholar
  11. Maxime Crochemore. Constant-space string-matching. In Proc. 8th FSTTCS, pages 80-87, 1988. Google Scholar
  12. Diego Díaz-Domínguez, Gonzalo Navarro, and Alejandro Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. 28th SPIRE, 2021. Google Scholar
  13. Paolo Ferragina and Rossano Venturini. Indexing compressed text. In Encyclopedia of Database Systems (2nd ed.). Springer, 2018. Google Scholar
  14. Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proc. Am. Math. Soc., 16(1):109-114, 1965. Google Scholar
  15. Johannes Fischer, Travis Gagie, Tsvi Kopelowitz, Moshe Lewenstein, Veli Mäkinen, Leena Salmela, and Niko Välimäki. Forbidden patterns. In Proc. 10th LATIN, pages 327-337, 2012. Google Scholar
  16. Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN, pages 731-742, 2014. Google Scholar
  17. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. 29th SODA, pages 1459-1477, 2018. Google Scholar
  18. Pawel Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Lacki, and Piotr Sankowski. Optimal dynamic strings. In Proc. 29th SODA, pages 1509-1528, 2018. Google Scholar
  19. Daniel Gibney and Sharma V. Thankachan. Text indexing for regular expression matching. Algorithms, 14(5):133, 2021. Google Scholar
  20. Leszek Gąsieniec, Roman M. Kolpakov, Igor Potapov, and Paul Sant. Real-time traversal in grammar-based compressed files. In Proc. 15th DCC, page 458, 2005. Google Scholar
  21. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  22. Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. String retrieval for multi-pattern queries. In Proc. 17th SPIRE, pages 55-66, 2010. Google Scholar
  23. Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Document listing for queries with excluded pattern. In Proc. 23rd CPM, pages 185-195, 2012. Google Scholar
  24. Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev, 31(2):249-260, 1987. Google Scholar
  25. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000. Google Scholar
  26. Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In Proc. 27th CPM, pages 24:1-24:10, 2016. Google Scholar
  27. Tsvi Kopelowitz, Seth Pettie, and Ely Porat. Higher lower bounds from the 3SUM conjecture. In Proc. 27th SODA, pages 1272-1287, 2016. Google Scholar
  28. Kasper Green Larsen, J. Ian Munro, Jesper Sindahl Nielsen, and Sharma V. Thankachan. On hardness of several string indexing problems. Theor. Comput. Sci., 582:74-82, 2015. Google Scholar
  29. Moshe Lewenstein. Orthogonal range searching for text indexing. In Space-Efficient Data Structures, Streams, and Algorithms, pages 267-302, 2013. Google Scholar
  30. Veli Mäkinen and Gonzalo Navarro. Compressed text indexing. In Encyclopedia of Algorithms, pages 394-397. Springer New York, 2016. Google Scholar
  31. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. Google Scholar
  32. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. 13th SODA, pages 657-666, 2002. Google Scholar
  33. Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci., 638:108-111, 2016. Google Scholar
  34. Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In Proc. 41st MFCS, volume 58, pages 72:1-72:15, 2016. Google Scholar
  35. Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003. Google Scholar
  36. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th SWAT, pages 1-11, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail