Computing NP-Hard Repetitiveness Measures via MAX-SAT

Bannai, Hideo; Goto, Keisuke; Ishihata, Masakazu; Kanda, Shunsuke; Köppl, Dominik; Nishimoto, Takaaki

doi:10.4230/LIPIcs.ESA.2022.12

Abstract

Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors.

Tooru Akagi, Mitsuru Funakoshi, and Shunsuke Inenaga. Sensitivity of string compressors and repetitiveness measures. CoRR, abs/2107.08615, 2021. URL: http://arxiv.org/abs/2107.08615.
Hideo Bannai, Mitsuru Funakoshi, Tomohiro I, Dominik Köppl, Takuya Mieno, and Takaaki Nishimoto. A separation of γ and b via Thue-Morse words. In Proc. SPIRE, volume 12944, pages 167-178, 2021. URL: https://doi.org/10.1007/978-3-030-86692-1_14.
Armin Biere, Marijn Heule, and Hans van Maaren. Handbook of satisfiability, volume 185. IOS press, 2009.
Philip Bille, Travis Gagie, Inge Li Gørtz, and Nicola Prezza. A separation between RLSLPs and LZ77. J. Discrete Algorithms, 50:36-39, 2018. URL: https://doi.org/10.1016/j.jda.2018.09.002.
Anselm Blumer, J. Blumer, David Haussler, Ross M. McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. J. ACM, 34(3):578-595, 1987. URL: https://doi.org/10.1145/28869.28873.
Katrin Casel, Henning Fernau, Serge Gaspers, Benjamin Gras, and Markus L. Schmid. On the complexity of the smallest grammar problem over fixed alphabets. Theory Comput. Syst., 65(2):344-409, 2021. URL: https://doi.org/10.1007/s00224-020-10013-w.
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1-8:39, 2021. URL: https://doi.org/10.1145/3426473.
Maxime Crochemore and Lucian Ilie. Computing longest previous factor in linear time and applications. Inf. Process. Lett., 106(2):75-80, 2008. URL: https://doi.org/10.1016/j.ipl.2007.10.006.
Patrick Dinklage, Jonas Ellert, Johannes Fischer, Dominik Köppl, and Manuel Penschuck. Bidirectional text compression in external memory. In Proc. ESA, pages 41:1-41:16, 2019. URL: https://doi.org/10.4230/LIPIcs.ESA.2019.41.
Patrick Dinklage, Johannes Fischer, Dominik Köppl, Marvin Löbel, and Kunihiko Sadakane. Compression with the tudocomp framework. In Proc. SEA, volume 75 of LIPIcs, pages 13:1-13:22, 2017.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. LZD factorization: Simple and practical online grammar compression with variable-to-fixed encoding. In Proc. CPM, volume 9133, pages 219-230, 2015. URL: https://doi.org/10.1007/978-3-319-19929-0_19.
OEIS Foundation Inc. Maximum, over all binary strings w of length n, of the size of the smallest string attractor for w, entry A339391 in the on-line encyclopedia of integer sequences. Accessed: 2022-04-13. URL: https://oeis.org/A339391.
Marek Karpinski, Wojciech Rytter, and Ayumi Shinohara. An efficient pattern-matching algorithm for strings with short descriptions. Nord. J. Comput., 4(2):172-186, 1997.
Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Sandy Irani, editor, Proc. FOCS, pages 1002-1013. IEEE, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00097.
Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String attractors: Verification and optimization. In Proc. ESA, pages 52:1-52:13, 2018. URL: https://doi.org/10.4230/LIPIcs.ESA.2018.52.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Proc. STOC, pages 827-840. ACM, 2018. URL: https://doi.org/10.1145/3188745.3188814.
Dominik Kempa and Barna Saha. An upper bound and linear-space queries on the lz-end parsing. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, pages 2847-2866. SIAM, 2022. URL: https://doi.org/10.1137/1.9781611977073.111.
Donald E. Knuth. The Art of Computer Programming, Volume 4, Fascicle 6: Satisfiability. Addison-Wesley Professional, 1st edition, 2015.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Proc. LATIN, pages 207-219, 2020.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. URL: https://doi.org/10.1016/j.tcs.2012.02.006.
Kanaru Kutsukake, Takuya Matsumoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. On repetitiveness measures of Thue-Morse words. In Proc. SPIRE, pages 213-220, 2020. URL: https://doi.org/10.1007/978-3-030-59212-7_15.
N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. In Proc. DCC, pages 296-305, 1999. URL: https://doi.org/10.1109/DCC.1999.755679.
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on information theory, 22(1):75-81, 1976.
Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, 4th Edition. Texts in Computer Science. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-11298-1.
Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nord. J. Comput., 12(1):40-66, 2005.
Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, and Marinella Sciortino. A combinatorial view on string attractors. Theor. Comput. Sci., 850:236-248, 2021. URL: https://doi.org/10.1016/j.tcs.2020.11.006.
Sabrina Mantaci, Antonio Restivo, and Marinella Sciortino. Burrows-Wheeler transform and sturmian words. Inf. Process. Lett., 86(5):241-246, 2003. URL: https://doi.org/10.1016/S0020-0190(02)00512-4.
Takuya Mieno, Shunsuke Inenaga, and Takashi Horiyama. Repair grammars are the smallest grammars for fibonacci words. CoRR, abs/2202.08447, 2022. URL: http://arxiv.org/abs/2202.08447.
Kazuyuki Narisawa, Hideharu Hiratsuka, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Efficient computation of substring equivalence classes with suffix arrays. Algorithmica, 79(2):291-318, 2017. URL: https://doi.org/10.1007/s00453-016-0178-z.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2021. URL: https://doi.org/10.1145/3434399.
Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1-26:32, 2021. URL: https://doi.org/10.1145/3432999.
Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Transactions on Information Theory, 67(2):1008-1026, 2020.
Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theor. Comput. Sci., 762:41-50, 2019. URL: https://doi.org/10.1016/j.tcs.2018.09.007.
Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res., 7:67-82, 1997. URL: https://doi.org/10.1613/jair.374.
Takaaki Nishimoto and Yasuo Tabei. LZRR: LZ77 parsing with right reference. Information and Computation, page 104859, 2021.
Luís M. S. Russo, Ana Sofia D. Correia, Gonzalo Navarro, and Alexandre P. Francisco. Approximating optimal bidirectional macro schemes. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2020, Snowbird, UT, USA, March 24-27, 2020, pages 153-162. IEEE, 2020. URL: https://doi.org/10.1109/DCC47342.2020.00023.
Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211-222, 2003. URL: https://doi.org/10.1016/S0304-3975(02)00777-6.
Hiroshi Sakamoto, Takuya Kida, and Shinichi Shimozono. A space-saving linear-time algorithm for grammar-based compression. In Proc. SPIRE, volume 3246, pages 218-229, 2004. URL: https://doi.org/10.1007/978-3-540-30213-1_33.
Hiroshi Sakamoto, Shinichi Shimozono, Ayumi Shinohara, and Masayuki Takeda. On the minimization problem of text compression scheme by a reduced grammar transform. Technical Report 195, Department of Informatics, 2001. URL: https://catalog.lib.kyushu-u.ac.jp/opac_download_md/3045/trcs195.pdf.
Luke Schaeffer and Jeffrey Shallit. String attractors for automatic sequences. CoRR, abs/2012.06840, 2020. URL: http://arxiv.org/abs/2012.06840.
Carsten Sinz. Towards an optimal CNF encoding of boolean cardinality constraints. In Peter van Beek, editor, Principles and Practice of Constraint Programming - CP 2005, 11th International Conference, CP 2005, Sitges, Spain, October 1-5, 2005, Proceedings, volume 3709 of Lecture Notes in Computer Science, pages 827-831. Springer, 2005. URL: https://doi.org/10.1007/11564751_73.
James A Storer and Thomas G Szymanski. Data compression via textual substitution. Journal of the ACM (JACM), 29(4):928-951, 1982.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337-343, 1977.
Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory, 24(5):530-536, 1978. URL: https://doi.org/10.1109/TIT.1978.1055934.

Computing NP-Hard Repetitiveness Measures via MAX-SAT

Authors Hideo Bannai , Keisuke Goto , Masakazu Ishihata, Shunsuke Kanda , Dominik Köppl , Takaaki Nishimoto

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Computing NP-Hard Repetitiveness Measures via MAX-SAT

Authors Hideo Bannai , Keisuke Goto , Masakazu Ishihata, Shunsuke Kanda , Dominik Köppl , Takaaki Nishimoto

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message