String Indexing with Compressed Patterns

Authors Philip Bille , Inge Li Gørtz , Teresa Anna Steiner



PDF
Thumbnail PDF

File

LIPIcs.STACS.2020.10.pdf
  • Filesize: 0.5 MB
  • 13 pages

Document Identifiers

Author Details

Philip Bille
  • Technical University of Denmark, DTU Compute, Denmark
Inge Li Gørtz
  • Technical University of Denmark, DTU Compute, Denmark
Teresa Anna Steiner
  • Technical University of Denmark, DTU Compute, Denmark

Cite As Get BibTex

Philip Bille, Inge Li Gørtz, and Teresa Anna Steiner. String Indexing with Compressed Patterns. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.STACS.2020.10

Abstract

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • string indexing
  • compression
  • pattern matching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen Alstrup, Thore Husfeldt, and Theis Rauhe. Marked ancestor problems. In Proc. 39th FOCS, pages 534-543, 1998. Google Scholar
  2. Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23, 2014. Google Scholar
  3. Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, and Hjalte Wedel Vildhøj. Time-space trade-offs for lempel-Ziv compressed indexing. Theoret. Comput. Sci., 713:66-77, 2018. Google Scholar
  4. Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In Proc. 26th CPM, pages 65-76, 2015. Google Scholar
  5. Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180-192, 2012. Google Scholar
  6. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st FOCS, pages 390-398, 2000. Google Scholar
  7. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th SODA, pages 269-278, 2001. Google Scholar
  8. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. Google Scholar
  9. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2):20, 2007. Google Scholar
  10. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538-544, 1984. Google Scholar
  11. Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN, pages 731-742, 2014. Google Scholar
  12. Travis Gagie and Simon J Puglisi. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol., 3:12, 2015. Google Scholar
  13. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841-850, 2003. Google Scholar
  14. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proc. 15th SODA, pages 636-645, 2004. Google Scholar
  15. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. Google Scholar
  16. Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984. Google Scholar
  17. Juha Kärkkäinen and Erkki Sutinen. Lempel-Ziv index for q-grams. Algorithmica, 21(1):137-154, 1998. Google Scholar
  18. Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd WSP, pages 141-155, 1996. Google Scholar
  19. Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev, 31(2):249-260, 1987. Google Scholar
  20. Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoret. Comp. Sci., 483:115-133, 2013. Google Scholar
  21. Veli Mäkinen. Compact suffix array. In Proc. 11th CPM, pages 305-319, 2000. Google Scholar
  22. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Bio., 17(3):281-308, 2010. Google Scholar
  23. Shirou Maruyama, Masaya Nakahara, Naoya Kishiue, and Hiroshi Sakamoto. ESP-index: A compressed index based on edit-sensitive parsing. J. Discrete Algorithms, 18:100-112, 2013. Google Scholar
  24. Gonzalo Navarro. Indexing highly repetitive collections. In Proc. 23rd IWOCA, pages 274-279, 2012. Google Scholar
  25. Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016. Google Scholar
  26. Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2, 2007. Google Scholar
  27. James A Storer and Thomas G Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982. Google Scholar
  28. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th FOCS, pages 1-11, 1973. Google Scholar
  29. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail