String Indexing with Compressed Patterns

Bille, Philip; Gørtz, Inge Li; Steiner, Teresa Anna

doi:10.4230/LIPIcs.STACS.2020.10

File

LIPIcs.STACS.2020.10.pdf

Filesize: 0.5 MB
13 pages

Document Identifiers

DOI: 10.4230/LIPIcs.STACS.2020.10
URN: urn:nbn:de:0030-drops-118716

Author Details

Philip Bille

Technical University of Denmark, DTU Compute, Denmark

Inge Li Gørtz

Technical University of Denmark, DTU Compute, Denmark

Teresa Anna Steiner

Technical University of Denmark, DTU Compute, Denmark

Cite AsGet BibTex

Philip Bille, Inge Li Gørtz, and Teresa Anna Steiner. String Indexing with Compressed Patterns. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.STACS.2020.10

Abstract

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms

Keywords

string indexing
compression
pattern matching

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Stephen Alstrup, Thore Husfeldt, and Theis Rauhe. Marked ancestor problems. In Proc. 39th FOCS, pages 534-543, 1998.
Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23, 2014.
Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, and Hjalte Wedel Vildhøj. Time-space trade-offs for lempel-Ziv compressed indexing. Theoret. Comput. Sci., 713:66-77, 2018.
Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In Proc. 26th CPM, pages 65-76, 2015.
Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Proc. 19th SPIRE, pages 180-192, 2012.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st FOCS, pages 390-398, 2000.
Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th SODA, pages 269-278, 2001.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005.
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2):20, 2007.
Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538-544, 1984.
Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. LZ77-based self-indexing with faster pattern matching. In Proc. 11th LATIN, pages 731-742, 2014.
Travis Gagie and Simon J Puglisi. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol., 3:12, 2015.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841-850, 2003.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proc. 15th SODA, pages 636-645, 2004.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005.
Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, 1984.
Juha Kärkkäinen and Erkki Sutinen. Lempel-Ziv index for q-grams. Algorithmica, 21(1):137-154, 1998.
Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd WSP, pages 141-155, 1996.
Richard M Karp and Michael O Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev, 31(2):249-260, 1987.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoret. Comp. Sci., 483:115-133, 2013.
Veli Mäkinen. Compact suffix array. In Proc. 11th CPM, pages 305-319, 2000.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Bio., 17(3):281-308, 2010.
Shirou Maruyama, Masaya Nakahara, Naoya Kishiue, and Hiroshi Sakamoto. ESP-index: A compressed index based on edit-sensitive parsing. J. Discrete Algorithms, 18:100-112, 2013.
Gonzalo Navarro. Indexing highly repetitive collections. In Proc. 23rd IWOCA, pages 274-279, 2012.
Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1):2, 2007.
James A Storer and Thomas G Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982.
Peter Weiner. Linear pattern matching algorithms. In Proc. 14th FOCS, pages 1-11, 1973.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977.

String Indexing with Compressed Patterns

Authors Philip Bille , Inge Li Gørtz , Teresa Anna Steiner

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

String Indexing with Compressed Patterns

Authors Philip Bille , Inge Li Gørtz , Teresa Anna Steiner

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message