Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing

Bille, Philip; Ettienne, Mikko Berggren; Gørtz, Inge Li; Vildhøj, Hjalte Wedel

doi:10.4230/LIPIcs.CPM.2017.16

Abstract

Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in

(i) O(m + occ lglg n) time using O(z lg(n/z) lglg z) space, or
(ii) O(m(1 + lg^e z / lg(n/z)) + occ(lglg n + lg^e z)) time using O(z lg(n/z)) space, for any 0 < e < 1

In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lglg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lg^e z / lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n^{1-d}), for constant d > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

Arne Andersson and Stefan Nilsson. A new efficient radix sort. In Shafi Goldwasser, editor, Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS 1994), pages 714-721. IEEE Computer Society, 1994. URL: http://dx.doi.org/10.1109/SFCS.1994.365721.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In Mark de Berg and Ulrich Meyer, editors, Proceedings of the 18th Annual European Symposium on Algorithms (ESA 2010), volume 6346 of LNCS, pages 427-438. Springer, 2010. URL: http://dx.doi.org/10.1007/978-3-642-15775-2_37.
Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015), volume 9133 of LNCS, pages 26-39. Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-319-19929-0_3.
Djamal Belazzougui, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Queries on LZ-bounded encodings. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Proceedings of the 2015 Data Compression Conference (DCC 2015), pages 83-92. IEEE, 2015. URL: http://dx.doi.org/10.1109/DCC.2015.69.
Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. In Juha Kärkkäinen and Jens Stoye, editors, Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching (CPM 2012), volume 7354 of LNCS. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-31265-6_24.
Timothy M. Chan, Kasper Green Larsen, and Mihai Pǎtraşcu. Orthogonal range searching on the RAM, revisited. In Ferran Hurtado and Marc J. van Kreveld, editors, Proceedings of the 27th ACM Symposium on Computational Geometry (SocG 2011), pages 1-10. ACM, 2011. URL: http://dx.doi.org/10.1145/1998196.1998198.
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554-2576, 2005. URL: http://dx.doi.org/10.1109/TIT.2005.850116.
Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Universal indexes for highly repetitive document collections. Inf. Syst., 61:1-23, 2016. URL: http://dx.doi.org/10.1016/j.is.2016.04.002.
Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Liliana Calderón-Benavides, Cristina N. González-Caro, Edgar Chávez, and Nivio Ziviani, editors, Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE 2012), volume 7608 of LNCS, pages 180-192. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-34109-0_19.
Martin Farach. Optimal suffix tree construction with large alphabets. In Anna Karlin, editor, Proceedings of the 38th Annual Symposium on Foundations of Computer Science (FOCS 1997), pages 137-143, Washington, DC, USA, 1997. IEEE Computer Society. URL: http://dx.doi.org/10.1109/SFCS.1997.646102.
Martin Farach and Mikkel Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20(4):388-404, 1998. URL: http://dx.doi.org/10.1007/PL00009202.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Avrim Blum, editor, Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS 2000), pages 390-398. IEEE Computer Society, 2000. URL: http://dx.doi.org/10.1109/SFCS.2000.892127.
Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In S. Rao Kosaraju, editor, Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2001), pages 269-278. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365458.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. URL: http://dx.doi.org/10.1145/1082036.1082039.
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2), 2007. URL: http://dx.doi.org/10.1145/1240233.1240243.
Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. J. ACM, 31(3):538-544, 1984. URL: http://dx.doi.org/10.1145/828.1884.
Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A faster grammar-based self-index. In Adrian-Horia Dediu and Carlos Martín-Vide, editors, Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA 2012), volume 7183 of LNCS, pages 240-251. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-28332-1_21.
Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In Alberto Pardo and Alfredo Viola, editors, Proceedings of the 11th Latin American Symposium on Theoretical Informatics (LATIN 2014), volume 8392 of LNCS, pages 731-742. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-642-54423-1_63.
Travis Gagie and Simon J. Puglisi. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol., 3:12, 2015. URL: http://dx.doi.org/10.3389/FBIOE.2015.00012.
Jean-Loup Gailly and Mark Adler. GNU zip, 1992. URL: http://www.gzip.org/.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Martin Farach-Colton, editor, Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pages 841-850. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. In J. Ian Munro, editor, Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2004), pages 636-645. SIAM, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982888.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In F. Frances Yao and Eugene M. Luks, editors, Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC 2000), pages 397-406. ACM, 2000. URL: http://dx.doi.org/10.1145/335305.335351.
Juha Kärkkäinen and Erkki Sutinen. Lempel-Ziv index for q-grams. Algorithmica, 21(1):137-154, 1998. URL: http://dx.doi.org/10.1007/PL00009205.
Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Nivio Ziviani, Ricardo Baeza-Yates, and Katia Guimarães, editors, Proceedings of the 3rd South American Workshop on String Processing (WSP 1996), pages 141-155. Carleton University Press, 1996.
Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. URL: http://dx.doi.org/10.1147/rd.312.0249.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. URL: http://dx.doi.org/10.1016/j.tcs.2012.02.006.
Moshe Lewenstein. Orthogonal range searching for text indexing. In Andrej Brodnik, Alejandro López-Ortiz, Venkatesh Raman, and Alfredo Viola, editors, Space-Efficient Data Structures, Streams, and Algorithms: Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, volume 8066 of LNCS, pages 267-302. Springer, 2013. URL: http://dx.doi.org/10.1007/978-3-642-40273-9_18.
Veli Mäkinen. Compact suffix array. In Raffaele Giancarlo and David Sankoff, editors, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM 2000), volume 1848 of LNCS, pages 305-319. Springer, 2000. URL: http://dx.doi.org/10.1007/3-540-45123-4_26.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: http://dx.doi.org/10.1089/cmb.2009.0169.
Donald R. Morrison. Patricia - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514-534, October 1968. URL: http://dx.doi.org/10.1145/321479.321481.
Gonzalo Navarro. Indexing highly repetitive collections. In S. Arumugam and W. F. Smyth, editors, Proceedings of the 23rd International Workshop on Combinatorial Algorithms (IWOCA 2012), volume 7643 of LNCS, pages 274-279. Springer, 2012. URL: http://dx.doi.org/10.1007/978-3-642-35926-2_29.
Gonzalo Navarro. Compact Data Structures: A practical approach. Cambridge University Press, 2016. URL: http://dx.doi.org/10.1017/CBO9781316588284.
Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1), April 2007. URL: http://dx.doi.org/10.1145/1216370.1216372.
Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In Daniel A. Spielman, editor, Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 315-323. IEEE Computer Society, 2009. URL: http://dx.doi.org/10.1109/FOCS.2009.11.
Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003. URL: http://dx.doi.org/10.1016/S0304-3975(02)00777-6.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337-343, 1977. URL: http://dx.doi.org/10.1109/TIT.1977.1055714.

Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing

Authors Philip Bille, Mikko Berggren Ettienne, Inge Li Gørtz, Hjalte Wedel Vildhøj

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message