BAT-LZ out of hell

Lipták, Zsuzsanna; Masillo, Francesco; Navarro, Gonzalo

doi:10.4230/LIPIcs.CPM.2024.21

Abstract

Despite consistently yielding the best compression on repetitive text collections, the Lempel-Ziv parsing has resisted all attempts at offering relevant guarantees on the cost to access an arbitrary symbol. This makes it less attractive for use on compressed self-indexes and other compressed data structures. In this paper we introduce a variant we call BAT-LZ (for Bounded Access Time Lempel-Ziv) where the access cost is bounded by a parameter given at compression time. We design and implement a linear-space algorithm that, in time O(nlog³ n), obtains a BAT-LZ parse of a text of length n by greedily maximizing each next phrase length. The algorithm builds on a new linear-space data structure that solves 5-sided orthogonal range queries in rank space, allowing updates to the coordinate where the one-sided queries are supported, in O(log³ n) time for both queries and updates. This time can be reduced to O(log² n) if O(nlog n) space is used.
We design a second algorithm that chooses the sources for the phrases in a clever way, using an enhanced suffix tree, albeit no longer guaranteeing longest possible phrases. This algorithm is much slower in theory, but in practice it is comparable to the greedy parser, while achieving significantly superior compression. We then combine the two algorithms, resulting in a parser that always chooses the longest possible phrases, and the best sources for those. Our experimentation shows that, on most repetitive texts, our algorithms reach an access cost close to log₂ n on texts of length n, while incurring almost no loss in the compression ratio when compared with classical LZ-compression. Several open challenges are discussed at the end of the paper.

Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages 85-96. Springer-Verlag, 1985.
Hideo Bannai, Mitsuru Funakoshi, Diptarama Hendrian, Myuji Matsuda, and Simon J. Puglisi. Height-bounded Lempel-Ziv encodings. CoRR, abs/2403.08209, 2024.
Djamal Belazzougui and Simon J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016.
Djamal Belazzougui and Simon J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016.
Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513-539, 2015.
Timothy M. Chan, Yakov Nekrich, Saladi Rahul, and Konstantinos Tsakalidis. Orthogonal point location and rectangle stabbing queries in 3-d. Journal of Computational Geometry, 13(1), 2022.
Timothy M. Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45-66, 2018.
Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005.
Gang Chen, Simon J. Puglisi, and William F. Smyth. Lempel-Ziv factorization using less time & space. Mathematics in Computer Science, 1:605-623, 2008.
Ferdinando Cicalese and Francesca Ugazio. On the complexity and approximability of bounded access Lempel Ziv coding. CoRR, abs/2403.15871, 2024. Submitted.
David R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996.
Francisco Claude, Gonzalo Navarro, and Alejandro Pacheco. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53-74, 2021.
Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15-32, 2015.
P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21:246-260, 1974.
Jonas Ellert, Johannes Fischer, and Max Rishøj Pedersen. New advances in rightmost Lempel-Ziv. In Proc. 30th International Symposium on String Processing and Information Retrieval (SPIRE), pages 188-202, 2023.
R. Fano. On the number of bits required to implement an associative memory. Memo 61, Computer Structures Group, Project MAC, Massachusetts, 1971.
Martin Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science (FOCS), pages 137-143. IEEE Computer Society, 1997.
J. Fischer. Combined data structure for previous- and next-smaller-values. Theoretical Computer Science, 412(22):2451-2456, 2011.
Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465-492, 2011.
Johannes Fischer, Tomohiro I, and Dominik Köppl. Lempel Ziv computation in small space (LZ-CISS). In Proc. 26th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 9133, pages 172-184, 2015.
Johannes Fischer, Tomohiro I, Dominik Köppl, and Kunihiko Sadakane. Lempel-Ziv factorization powered by space efficient suffix trees. Algorithmica, 80(7):2048-2081, 2018.
Moses Ganardi, Artur Jez, and Markus Lohrey. Balancing straight-line programs. Journal of the ACM, 68(4):article 27, 2021.
Keisuke Goto and Hideo Bannai. Simpler and faster Lempel Ziv factorization. In Proc. 23rd Data Compression Conference (DCC), pages 133-142, 2013.
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lightweight Lempel-Ziv parsing. In Proc. 12th International Symposium on Experimental Algorithms (SEA), pages 139-150, 2013.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7922, pages 189-200, 2013.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lazy Lempel-Ziv factorization algorithms. ACM Journal of Experimental Algorithmics, 21(1):2.4:1-2.4:19, 2016.
Dominik Kempa and Dmitry Kosolobov. LZ-End parsing in compressed space. In Proc. 27th Data Compression Conference (DCC), pages 350-359, 2017.
Dominik Kempa and Dmitry Kosolobov. LZ-End parsing in linear time. In Proc. 25th Annual European Symposium on Algorithms (ESA), pages 53:1-53:14, 2017.
Dominik Kempa and Barna Saha. An upper bound and linear-space queries on the LZ-End parsing. In Proc. 33rd ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2847-2866, 2022.
John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000.
Dominik Köppl and Kunihiko Sadakane. Lempel-Ziv computation in compressed space (LZ-CICS). In Proc. 26th Data Compression Conference (DCC), pages 3-12, 2016.
Sebastian Kreft and Gonzalo Navarro. Lz77-like compression with fast random access. In Proc. 20th Data Compression Conference (DCC), pages 239-248, 2010.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013.
J. Larsson and A. Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000.
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976.
M. Lothaire. Algebraic Combinatorics on Words. Cambridge University Press, 2002.
Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272, 1976.
J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pages 37-42, 1996.
J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. Fast construction of wavelet trees. Theoretical Computer Science, 638:91-97, 2016.
Gonzalo Navarro. Compact Data Structures - A practical approach. Cambridge University Press, 2016.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021.
Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021.
Yakov Nekrich. Orthogonal range searching in linear and almost-linear space. Computational Geometry, 42(4):342-351, 2009.
Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471-1484, 2011.
Enno Ohlebusch and Simon Gog. Lempel-Ziv factorization revisited. In Proc. 22nd Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 6661, pages 15-26, 2011.
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):article 4, 2007.
Michael Rodeh, Vaughan R. Pratt, and Shimon Even. Linear algorithm for data compression via string matching. Journal of the ACM, 28(1):16-24, 1981.
Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1-3):211-222, 2003.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995.
Elad Verbin and Wei Yu. Data structure lower bounds on random access to grammar-compressed strings. In Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7922, pages 247-258, 2013.
Peter Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory (SWAT), pages 1-11. IEEE Computer Society, 1973.

BAT-LZ out of hell

Authors Zsuzsanna Lipták , Francesco Masillo , Gonzalo Navarro

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

BAT-LZ out of hell

Authors Zsuzsanna Lipták , Francesco Masillo , Gonzalo Navarro

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message