Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing

Authors Dominik Kempa , Ben Langmead



PDF
Thumbnail PDF

File

LIPIcs.ESA.2021.56.pdf
  • Filesize: 0.79 MB
  • 14 pages

Document Identifiers

Author Details

Dominik Kempa
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Ben Langmead
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Acknowledgements

We thank Simon J. Puglisi for providing us with the kernel data set at short notice.

Cite As Get BibTex

Dominik Kempa and Ben Langmead. Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing. In 29th Annual European Symposium on Algorithms (ESA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 204, pp. 56:1-56:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.ESA.2021.56

Abstract

Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: 𝒪(z log n) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical.
We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Theory of computation → Pattern matching
Keywords
  • grammar compression
  • straight-line program
  • SLP
  • AVL grammar
  • Lempel-Ziv compression
  • LZ77
  • dictionary compression

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Queries on LZ-bounded encodings. In DCC, pages 83-92. IEEE, 2015. URL: https://doi.org/10.1109/DCC.2015.69.
  2. Timo Bingmann, Johannes Fischer, and Vitaly Osipov. Inducing suffix and LCP arrays in external memory. In ALENEX, pages 88-102. SIAM, 2013. URL: https://doi.org/10.1137/1.9781611972931.8.
  3. Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, and Massimiliano Rossi. PHONI: Streamed matching statistics with multi-genome references. In DCC, pages 193-202. IEEE, 2021. URL: https://doi.org/10.1109/DCC50243.2021.00027.
  4. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. URL: https://doi.org/10.1186/s13015-019-0148-5.
  5. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994. Google Scholar
  6. Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. Trans. Inf. Theory, 51(7):2554-2576, 2005. URL: https://doi.org/10.1109/TIT.2005.850116.
  7. Francisco Claude and Gonzalo Navarro. Self-indexed grammar-based compression. Fundam. Informaticae, 111(3):313-337, 2011. URL: https://doi.org/10.3233/FI-2011-565.
  8. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. URL: https://doi.org/10.1038/nature15393.
  9. Martin Dietzfelbinger, Joseph Gil, Yossi Matias, and Nicholas Pippenger. Polynomial hash functions are reliable. In ICALP, pages 235-246, 1992. URL: https://doi.org/10.1007/3-540-55719-9_77.
  10. Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707-730, 2012. Google Scholar
  11. Isamu Furuya, Takuya Takagi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Takuya Kida. MR-RePair: Grammar compression based on maximal repeats. In DCC, pages 508-517. IEEE, 2019. URL: https://doi.org/10.1109/DCC.2019.00059.
  12. Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In LATIN, pages 731-742. Springer, 2014. URL: https://doi.org/10.1007/978-3-642-54423-1_63.
  13. Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A faster grammar-based self-index. In LATA, pages 240-251. Springer, 2012. URL: https://doi.org/10.1007/978-3-642-28332-1_21.
  14. Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, and Yoshimasa Takabatake. Rpair: Rescaling RePair with Rsync. In SPIRE, pages 35-44, 2019. URL: https://doi.org/10.1007/978-3-030-32686-9_3.
  15. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. On the approximation ratio of Lempel-Ziv parsing. In LATIN, pages 490-503. Springer, 2018. URL: https://doi.org/10.1007/978-3-319-77404-6_36.
  16. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):1-54, April 2020. URL: https://doi.org/10.1145/3375890.
  17. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In DCC, pages 153-162. IEEE, 2014. URL: https://doi.org/10.1109/DCC.2014.78.
  18. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In CPM, pages 329-342. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_28.
  19. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
  20. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. In FOCS, pages 1002-1013. IEEE, 2020. Full version: https://arxiv.org/abs/1910.10631. URL: https://doi.org/10.1109/FOCS46700.2020.00097.
  21. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In STOC, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
  22. Donald E. Knuth. The Art of Computing, Vol. III, 2nd Ed. Addison-Wesley, 1998. Google Scholar
  23. Dominik Köppl, Tomohiro I, Isamu Furuya, Yoshimasa Takabatake, Kensuke Sakai, and Keisuke Goto. Re-Pair in small space. Algorithms, 14(1):5, 2021. URL: https://doi.org/10.3390/a14010005.
  24. Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. Lempel-Ziv-like parsing in small space. Algorithmica, 82(11):3195-3215, 2020. URL: https://doi.org/10.1007/s00453-020-00722-6.
  25. N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722-1732, 2000. Google Scholar
  26. Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2021. URL: https://doi.org/10.1145/3434399.
  27. Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Comput. Surv., 54(2):26:1-26:32, 2021. URL: https://doi.org/10.1145/3432999.
  28. Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discrete Alg., 52-53:18-28, 2018. URL: https://doi.org/10.1016/j.jda.2018.11.002.
  29. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986-2011, 2018. URL: https://doi.org/10.1007/s00453-017-0327-z.
  30. Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003. URL: https://doi.org/10.1016/S0304-3975(02)00777-6.
  31. Kensuke Sakai, Tatsuya Ohno, Keisuke Goto, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. RePair in compressed space and time. In DCC, pages 518-527. IEEE, 2019. URL: https://doi.org/10.1109/DCC.2019.00060.
  32. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. Trans. Inf. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail