Compressed Indexing for Consecutive Occurrences

Gawrychowski, Paweł; Gourdel, Garance; Starikovskaya, Tatiana; Steiner, Teresa Anna

doi:10.4230/LIPIcs.CPM.2023.12

Abstract

The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern P. However, practical applications motivate the necessity of considering more complex queries, for example concerning near occurrences of two patterns. Recently, Bille et al. [CPM 2021] introduced a variant of such queries, called gapped consecutive occurrences, in which a query consists of two patterns P₁ and P₂ and a range [a,b], and one must find all consecutive occurrences (q₁,q₂) of P₁ and P₂ such that q₂-q₁ ∈ [a,b]. By their results, we cannot hope for a very efficient indexing structure for such queries, even if a = 0 is fixed (although at the same time they provided a non-trivial upper bound). Motivated by this, we focus on a text given as a straight-line program (SLP) and design an index taking space polynomial in the size of the grammar that answers such queries in time optimal up to polylog factors.

Amir Abboud, Arturs Backurs, Karl Bringmann, and Marvin Künnemann. Fine-grained complexity of analyzing compressed data: Quantifying improvements over decompress-and-solve. In Proc. 58th FOCS, pages 192-203, 2017.
Amir Abboud, Arturs Backurs, Karl Bringmann, and Marvin Künnemann. Impossibility results for grammar-compressed linear algebra. In Proc. 34th NeurIPS, pages 8810-8823, 2020.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In Proc. 18th ESA, pages 427-438, 2010.
Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. In Proc. 32nd CPM, pages 10:1-10:19, 2021.
Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM J. Comput., 44(3):513-539, 2015.
Timothy M. Chan. Persistent predecessor search and orthogonal point location on the word RAM. ACM Trans. Algorithms, 9(3):22:1-22:22, 2013.
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, April Rasala, Amit Sahai, and Abhi Shelat. Approximating the smallest grammar: Kolmogorov complexity in natural models. In Proc. 34th STOC, pages 792-801, 2002.
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1-8:39, 2021.
Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180-192, 2012.
Francisco Claude, Gonzalo Navarro, and Alejandro Pacheco. Grammar-compressed indexes with logarithmic search time. J. Comput. Syst. Sci., 118:53-74, 2021.
Maxime Crochemore. Constant-space string-matching. In Proc. 8th FSTTCS, pages 80-87, 1988.
Diego Díaz-Domínguez, Gonzalo Navarro, and Alejandro Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. 28th SPIRE, 2021.
Paolo Ferragina and Rossano Venturini. Indexing compressed text. In Encyclopedia of Database Systems (2nd ed.). Springer, 2018.
Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proc. Am. Math. Soc., 16(1):109-114, 1965.
Johannes Fischer, Travis Gagie, Tsvi Kopelowitz, Moshe Lewenstein, Veli Mäkinen, Leena Salmela, and Niko Välimäki. Forbidden patterns. In Proc. 10th LATIN, pages 327-337, 2012.
Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN, pages 731-742, 2014.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. 29th SODA, pages 1459-1477, 2018.
Pawel Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Lacki, and Piotr Sankowski. Optimal dynamic strings. In Proc. 29th SODA, pages 1509-1528, 2018.
Daniel Gibney and Sharma V. Thankachan. Text indexing for regular expression matching. Algorithms, 14(5):133, 2021.
Leszek Gąsieniec, Roman M. Kolpakov, Igor Potapov, and Paul Sant. Real-time traversal in grammar-based compressed files. In Proc. 15th DCC, page 458, 2005.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. String retrieval for multi-pattern queries. In Proc. 17th SPIRE, pages 55-66, 2010.
Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Document listing for queries with excluded pattern. In Proc. 23rd CPM, pages 185-195, 2012.
Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev, 31(2):249-260, 1987.
John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000.
Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In Proc. 27th CPM, pages 24:1-24:10, 2016.
Tsvi Kopelowitz, Seth Pettie, and Ely Porat. Higher lower bounds from the 3SUM conjecture. In Proc. 27th SODA, pages 1272-1287, 2016.
Kasper Green Larsen, J. Ian Munro, Jesper Sindahl Nielsen, and Sharma V. Thankachan. On hardness of several string indexing problems. Theor. Comput. Sci., 582:74-82, 2015.
Moshe Lewenstein. Orthogonal range searching for text indexing. In Space-Efficient Data Structures, Streams, and Algorithms, pages 267-302, 2013.
Veli Mäkinen and Gonzalo Navarro. Compressed text indexing. In Encyclopedia of Algorithms, pages 394-397. Springer New York, 2016.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993.
S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. 13th SODA, pages 657-666, 2002.
Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci., 638:108-111, 2016.
Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In Proc. 41st MFCS, volume 58, pages 72:1-72:15, 2016.
Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003.
Peter Weiner. Linear pattern matching algorithms. In Proc. 14th SWAT, pages 1-11, 1973.

Compressed Indexing for Consecutive Occurrences

Authors Paweł Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, Teresa Anna Steiner

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Compressed Indexing for Consecutive Occurrences

Authors Paweł Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, Teresa Anna Steiner

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message