BAT-LZ out of hell

Authors Zsuzsanna Lipták , Francesco Masillo , Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.21.pdf
  • Filesize: 0.9 MB
  • 17 pages

Document Identifiers

Author Details

Zsuzsanna Lipták
  • Dipartimento di Informatica, University of Verona, Italy
Francesco Masillo
  • Dipartimento di Informatica, University of Verona, Italy
Gonzalo Navarro
  • Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of Chile, Chile

Cite AsGet BibTex

Zsuzsanna Lipták, Francesco Masillo, and Gonzalo Navarro. BAT-LZ out of hell. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 21:1-21:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.21

Abstract

Despite consistently yielding the best compression on repetitive text collections, the Lempel-Ziv parsing has resisted all attempts at offering relevant guarantees on the cost to access an arbitrary symbol. This makes it less attractive for use on compressed self-indexes and other compressed data structures. In this paper we introduce a variant we call BAT-LZ (for Bounded Access Time Lempel-Ziv) where the access cost is bounded by a parameter given at compression time. We design and implement a linear-space algorithm that, in time O(nlog³ n), obtains a BAT-LZ parse of a text of length n by greedily maximizing each next phrase length. The algorithm builds on a new linear-space data structure that solves 5-sided orthogonal range queries in rank space, allowing updates to the coordinate where the one-sided queries are supported, in O(log³ n) time for both queries and updates. This time can be reduced to O(log² n) if O(nlog n) space is used. We design a second algorithm that chooses the sources for the phrases in a clever way, using an enhanced suffix tree, albeit no longer guaranteeing longest possible phrases. This algorithm is much slower in theory, but in practice it is comparable to the greedy parser, while achieving significantly superior compression. We then combine the two algorithms, resulting in a parser that always chooses the longest possible phrases, and the best sources for those. Our experimentation shows that, on most repetitive texts, our algorithms reach an access cost close to log₂ n on texts of length n, while incurring almost no loss in the compression ratio when compared with classical LZ-compression. Several open challenges are discussed at the end of the paper.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
Keywords
  • Lempel-Ziv parsing
  • data compression
  • compressed data structures
  • repetitive text collections

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages 85-96. Springer-Verlag, 1985. Google Scholar
  2. Hideo Bannai, Mitsuru Funakoshi, Diptarama Hendrian, Myuji Matsuda, and Simon J. Puglisi. Height-bounded Lempel-Ziv encodings. CoRR, abs/2403.08209, 2024. Google Scholar
  3. Djamal Belazzougui and Simon J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016. Google Scholar
  4. Djamal Belazzougui and Simon J. Puglisi. Range predecessor and Lempel-Ziv parsing. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2053-2071, 2016. Google Scholar
  5. Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513-539, 2015. Google Scholar
  6. Timothy M. Chan, Yakov Nekrich, Saladi Rahul, and Konstantinos Tsakalidis. Orthogonal point location and rectangle stabbing queries in 3-d. Journal of Computational Geometry, 13(1), 2022. Google Scholar
  7. Timothy M. Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45-66, 2018. Google Scholar
  8. Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005. Google Scholar
  9. Gang Chen, Simon J. Puglisi, and William F. Smyth. Lempel-Ziv factorization using less time & space. Mathematics in Computer Science, 1:605-623, 2008. Google Scholar
  10. Ferdinando Cicalese and Francesca Ugazio. On the complexity and approximability of bounded access Lempel Ziv coding. CoRR, abs/2403.15871, 2024. Submitted. Google Scholar
  11. David R. Clark. Compact PAT Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  12. Francisco Claude, Gonzalo Navarro, and Alejandro Pacheco. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences, 118:53-74, 2021. Google Scholar
  13. Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15-32, 2015. Google Scholar
  14. P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21:246-260, 1974. Google Scholar
  15. Jonas Ellert, Johannes Fischer, and Max Rishøj Pedersen. New advances in rightmost Lempel-Ziv. In Proc. 30th International Symposium on String Processing and Information Retrieval (SPIRE), pages 188-202, 2023. Google Scholar
  16. R. Fano. On the number of bits required to implement an associative memory. Memo 61, Computer Structures Group, Project MAC, Massachusetts, 1971. Google Scholar
  17. Martin Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science (FOCS), pages 137-143. IEEE Computer Society, 1997. Google Scholar
  18. J. Fischer. Combined data structure for previous- and next-smaller-values. Theoretical Computer Science, 412(22):2451-2456, 2011. Google Scholar
  19. Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465-492, 2011. Google Scholar
  20. Johannes Fischer, Tomohiro I, and Dominik Köppl. Lempel Ziv computation in small space (LZ-CISS). In Proc. 26th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 9133, pages 172-184, 2015. Google Scholar
  21. Johannes Fischer, Tomohiro I, Dominik Köppl, and Kunihiko Sadakane. Lempel-Ziv factorization powered by space efficient suffix trees. Algorithmica, 80(7):2048-2081, 2018. Google Scholar
  22. Moses Ganardi, Artur Jez, and Markus Lohrey. Balancing straight-line programs. Journal of the ACM, 68(4):article 27, 2021. Google Scholar
  23. Keisuke Goto and Hideo Bannai. Simpler and faster Lempel Ziv factorization. In Proc. 23rd Data Compression Conference (DCC), pages 133-142, 2013. Google Scholar
  24. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  25. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lightweight Lempel-Ziv parsing. In Proc. 12th International Symposium on Experimental Algorithms (SEA), pages 139-150, 2013. Google Scholar
  26. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7922, pages 189-200, 2013. Google Scholar
  27. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lazy Lempel-Ziv factorization algorithms. ACM Journal of Experimental Algorithmics, 21(1):2.4:1-2.4:19, 2016. Google Scholar
  28. Dominik Kempa and Dmitry Kosolobov. LZ-End parsing in compressed space. In Proc. 27th Data Compression Conference (DCC), pages 350-359, 2017. Google Scholar
  29. Dominik Kempa and Dmitry Kosolobov. LZ-End parsing in linear time. In Proc. 25th Annual European Symposium on Algorithms (ESA), pages 53:1-53:14, 2017. Google Scholar
  30. Dominik Kempa and Barna Saha. An upper bound and linear-space queries on the LZ-End parsing. In Proc. 33rd ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2847-2866, 2022. Google Scholar
  31. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000. Google Scholar
  32. Dominik Köppl and Kunihiko Sadakane. Lempel-Ziv computation in compressed space (LZ-CICS). In Proc. 26th Data Compression Conference (DCC), pages 3-12, 2016. Google Scholar
  33. Sebastian Kreft and Gonzalo Navarro. Lz77-like compression with fast random access. In Proc. 20th Data Compression Conference (DCC), pages 239-248, 2010. Google Scholar
  34. Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115-133, 2013. Google Scholar
  35. J. Larsson and A. Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000. Google Scholar
  36. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976. Google Scholar
  37. M. Lothaire. Algebraic Combinatorics on Words. Cambridge University Press, 2002. Google Scholar
  38. Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262-272, 1976. Google Scholar
  39. J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pages 37-42, 1996. Google Scholar
  40. J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. Fast construction of wavelet trees. Theoretical Computer Science, 638:91-97, 2016. Google Scholar
  41. Gonzalo Navarro. Compact Data Structures - A practical approach. Cambridge University Press, 2016. Google Scholar
  42. Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021. Google Scholar
  43. Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Computing Surveys, 54(2):article 26, 2021. Google Scholar
  44. Yakov Nekrich. Orthogonal range searching in linear and almost-linear space. Computational Geometry, 42(4):342-351, 2009. Google Scholar
  45. Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471-1484, 2011. Google Scholar
  46. Enno Ohlebusch and Simon Gog. Lempel-Ziv factorization revisited. In Proc. 22nd Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 6661, pages 15-26, 2011. Google Scholar
  47. Simon J. Puglisi, William F. Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):article 4, 2007. Google Scholar
  48. Michael Rodeh, Vaughan R. Pratt, and Shimon Even. Linear algorithm for data compression via string matching. Journal of the ACM, 28(1):16-24, 1981. Google Scholar
  49. Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1-3):211-222, 2003. Google Scholar
  50. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. Google Scholar
  51. Elad Verbin and Wei Yu. Data structure lower bounds on random access to grammar-compressed strings. In Proc. 24th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7922, pages 247-258, 2013. Google Scholar
  52. Peter Weiner. Linear pattern matching algorithms. In Proc. 14th Annual Symposium on Switching and Automata Theory (SWAT), pages 1-11. IEEE Computer Society, 1973. Google Scholar