A Space-Optimal Grammar Compression

Authors Yoshimasa Takabatake, Tomohiro I, Hiroshi Sakamoto



PDF
Thumbnail PDF

File

LIPIcs.ESA.2017.67.pdf
  • Filesize: 1.54 MB
  • 15 pages

Document Identifiers

Author Details

Yoshimasa Takabatake
Tomohiro I
Hiroshi Sakamoto

Cite As Get BibTex

Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A Space-Optimal Grammar Compression. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 67:1-67:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.ESA.2017.67

Abstract

A grammar compression is a context-free grammar (CFG) deriving a single string deterministically. For an input string of length N over an alphabet of size sigma, the smallest CFG is O(log N)-approximable in the offline setting and O(log N log^* N)-approximable in the online setting. In addition, an information-theoretic lower bound for representing a CFG in Chomsky normal form of n variables is log (n!/n^sigma) + n + o(n) bits. Although there is an online grammar compression algorithm that directly computes the succinct encoding of its output CFG with O(log N log^* N) approximation guarantee, the problem of optimizing its working space has remained open. We propose a fully-online algorithm that requires the fewest bits of working space asymptotically equal to the lower bound in O(N log log n) compression time. In addition we propose several techniques to boost grammar compression and show their efficiency by computational experiments.

Subject Classification

Keywords
  • Grammar compression
  • fully-online algorithm
  • succinct data structure

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stephen Alstrup and Jacob Holm. Improved algorithms for finding level ancestors in dynamic trees. In 27th International Colloquium on Automata, Languages and Programming, pages 73-84, 2000. URL: http://dx.doi.org/10.1007/3-540-45022-X_8.
  2. Philip Bille, Inge Li Gørtz, and Nicola Prezza. Space-efficient re-pair compression. In Data Compression Conference, pages 171-180, 2017. URL: http://dx.doi.org/10.1109/DCC.2017.24.
  3. Francisco Claude and Gonzalo Navarro. Self-indexed grammar-based compression. Fundam. Inform., 111(3):313-337, 2011. URL: http://dx.doi.org/10.3233/FI-2011-565.
  4. Graham Cormode and S. Muthukrishnan. The string edit distance matching problem with moves. ACM Trans. Algorithms, 3(1):2:1-2:19, 2007. URL: http://dx.doi.org/10.1145/1219944.1219947.
  5. Martin Dietzfelbinger, Anna R. Karlin, Kurt Mehlhorn, Friedhelm Meyer auf der Heide, Hans Rohnert, and Robert Endre Tarjan. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4):738-761, 1994. URL: http://dx.doi.org/10.1137/S0097539791194094.
  6. Johannes Fischer. Optimal succinctness for range minimum queries. In Theoretical Informatics, 9th Latin American Symposium, LATIN 2010, Oaxaca, Mexico, April 19-23, 2010. Proceedings, pages 158-169, 2010. URL: http://dx.doi.org/10.1007/978-3-642-12200-2_16.
  7. Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In 23rd Annual European Symposium on Algorithms, pages 533-544, 2015. URL: http://dx.doi.org/10.1007/978-3-662-48350-3_45.
  8. Shouhei Fukunaga, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. Online grammar compression for frequent pattern discovery. In 13th International Conference on Grammatical Inference, pages 93-104, 2016. Google Scholar
  9. Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A faster grammar-based self-index. In 6th International Conference Language and Automata Theory and Applications, pages 240-251, 2012. URL: http://dx.doi.org/10.1007/978-3-642-28332-1_21.
  10. Pawel Gawrychowski. Optimal pattern matching in LZW compressed strings. ACM Trans. Algorithms, 9(3):25:1-25:17, 2013. URL: http://dx.doi.org/10.1145/2483699.2483705.
  11. Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 368-373, 2006. Google Scholar
  12. Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Fast q-gram mining on SLP compressed strings. J. Discrete Algorithms, 18:89-99, 2013. URL: http://dx.doi.org/10.1016/j.jda.2012.07.006.
  13. Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. LZD factorization: Simple and practical online grammar compression with variable-to-fixed encoding. In 26th Annual Symposium on Combinatorial Pattern Matching, pages 219-230, 2015. URL: http://dx.doi.org/10.1007/978-3-319-19929-0_19.
  14. Danny Hermelin, Gad M. Landau, Shir Landau, and Oren Weimann. A unified algorithm for accelerating edit-distance computation via text-compression. In 26th International Symposium on Theoretical Aspects of Computer Science, pages 529-540, 2009. URL: http://dx.doi.org/10.4230/LIPIcs.STACS.2009.1804.
  15. Tomohiro I, Wataru Matsubara, Kouji Shimohira, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Kazuyuki Narisawa, and Ayumi Shinohara. Detecting regularities on grammar-compressed strings. Inf. Comput., 240:74-89, 2015. URL: http://dx.doi.org/10.1016/j.ic.2014.09.009.
  16. Guy Jacobson. Space-efficient static trees and graphs. In 30th Annual Symposium on Foundations of Computer Science, pages 549-554, 1989. URL: http://dx.doi.org/10.1109/SFCS.1989.63533.
  17. Artur Jez. Faster fully compressed pattern matching by recompression. ACM Trans. Algorithms, 11(3):20:1-20:43, 2015. URL: http://dx.doi.org/10.1145/2631920.
  18. Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28:51-55, 2003. URL: http://dx.doi.org/10.1145/762471.762473.
  19. Marek Karpinski, Wojciech Rytter, and Ayumi Shinohara. An efficient pattern-matching algorithm for strings with short descriptions. Nord. J. Comput., 4(2):172-186, 1997. Google Scholar
  20. Dominik Kempa and Dmitry Kosolobov. Lz-end parsing in compressed space. In Data Compression Conference, pages 350-359, 2017. URL: http://dx.doi.org/10.1109/DCC.2017.73.
  21. Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. URL: http://dx.doi.org/10.1016/j.tcs.2012.02.006.
  22. N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. Proc. IEEE, 88(11):1722-1732, 2000. URL: http://dx.doi.org/10.1109/5.892708.
  23. Eric Lehman. Approximation algorithms for grammar-based data compression. PhD thesis, MIT, Cambridge, MA, USA, 2002. Google Scholar
  24. Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In 28th International Conference on Very Large Data Bases, pages 346-357, 2002. Google Scholar
  25. Shirou Maruyama, Masaya Nakahara, Naoya Kishiue, and Hiroshi Sakamoto. Esp-index: A compressed index based on edit-sensitive parsing. J. Discrete Algorithms, 18:100-112, 2013. URL: http://dx.doi.org/10.1016/j.jda.2012.07.009.
  26. Shirou Maruyama, Hiroshi Sakamoto, and Masayuki Takeda. An online algorithm for lightweight grammar-based compression. Algorithms, 5(2):214-235, 2012. URL: http://dx.doi.org/10.3390/a5020214.
  27. Shirou Maruyama and Yasuo Tabei. Fully online grammar compression in constant space. In Data Compression Conference, pages 173-182, 2014. URL: http://dx.doi.org/10.1109/DCC.2014.69.
  28. Shirou Maruyama, Yasuo Tabei, Hiroshi Sakamoto, and Kunihiko Sadakane. Fully-online grammar compression. In 20th International Symposium on String Processing and Information Retrieval, pages 218-229, 2013. URL: http://dx.doi.org/10.1007/978-3-319-02432-5_25.
  29. Wataru Matsubara, Shunsuke Inenaga, Akira Ishino, Ayumi Shinohara, Tomoyuki Nakamura, and Kazuo Hashimoto. Efficient algorithms to compute compressed longest common substrings and compressed palindromes. Theor. Comput. Sci., 410(8-10):900-913, 2009. URL: http://dx.doi.org/10.1016/j.tcs.2008.12.016.
  30. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and top-k elements in data streams. In 10th International Conference on Database Theory, pages 398-412, 2005. URL: http://dx.doi.org/10.1007/978-3-540-30570-5_27.
  31. J. Ian Munro, Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct representations of permutations and functions. Theor. Comput. Sci., 438:74-88, 2012. URL: http://dx.doi.org/10.1016/j.tcs.2012.03.005.
  32. Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Trans. Algorithms, 10(3):16:1-16:39, 2014. URL: http://dx.doi.org/10.1145/2601073.
  33. Craig G. Nevill-Manning and Ian H. Witten. Compression and explanation using hierarchical grammars. Comput. J., 40(2/3):103-116, 1997. Google Scholar
  34. Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Dynamic index and LZ factorization in compressed space. In Prague Stringology Conference, pages 158-170, 2016. Google Scholar
  35. Tatsuya Ohno, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online run-length Burrows-Wheeler Transform. In 28th International Workshop on Combinatorial Algorithms (to appear), 2017. Google Scholar
  36. Alberto Policriti and Nicola Prezza. Computing LZ77 in run-compressed space. In 2016 Data Compression Conference, pages 23-32, 2016. URL: http://dx.doi.org/10.1109/DCC.2016.30.
  37. Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. URL: http://dx.doi.org/10.1145/1290672.1290680.
  38. Luís M. S. Russo and Arlindo L. Oliveira. A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr., 11(4):359-388, 2008. URL: http://dx.doi.org/10.1007/s10791-008-9050-3.
  39. Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003. URL: http://dx.doi.org/10.1016/S0304-3975(02)00777-6.
  40. Hiroshi Sakamoto, Shirou Maruyama, Takuya Kida, and Shinichi Shimozono. A space-saving approximation algorithm for grammar-based compression. IEICE Transactions, 92-D(2):158-165, 2009. URL: http://dx.doi.org/10.1587/transinf.E92.D.158.
  41. Yasuo Tabei, Hiroto Saigo, Yoshihiro Yamanishi, and Simon J. Puglisi. Scalable partial least squares regression on grammar-compressed data matrices. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1875-1884, 2016. URL: http://dx.doi.org/10.1145/2939672.2939864.
  42. Yasuo Tabei, Yoshimasa Takabatake, and Hiroshi Sakamoto. A succinct grammar compression. In Combinatorial Pattern Matching, 24th Annual Symposium, CPM 2013, Bad Herrenalb, Germany, June 17-19, 2013. Proceedings, pages 235-246, 2013. URL: http://dx.doi.org/10.1007/978-3-642-38905-4_23.
  43. Yoshimasa Takabatake, Kenta Nakashima, Tetsuji Kuboyama, Yasuo Tabei, and Hiroshi Sakamoto. siedm: An efficient string index and search algorithm for edit distance with moves. Algorithms, 9(2):26, 2016. URL: http://dx.doi.org/10.3390/a9020026.
  44. Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. Variable-length codes for space-efficient grammar-based compression. In 19th International Symposium on String Processing and Information Retrieval, pages 398-410, 2012. URL: http://dx.doi.org/10.1007/978-3-642-34109-0_42.
  45. Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. Improved esp-index: A practical self-index for highly repetitive texts. In 13th International Symposium on Experimental Algorithms, pages 338-350, 2014. URL: http://dx.doi.org/10.1007/978-3-319-07959-2_29.
  46. Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. Online self-indexed grammar compression. In 22nd International Symposium on String Processing and Information Retrieval, pages 258-269, 2015. URL: http://dx.doi.org/10.1007/978-3-319-23826-5_25.
  47. Terry A. Welch. A technique for high-performance data compression. IEEE Computer, 17(6):8-19, 1984. URL: http://dx.doi.org/10.1109/MC.1984.1659158.
  48. Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory, 24(5):530-536, 1978. URL: http://dx.doi.org/10.1109/TIT.1978.1055934.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail