String Attractors: Verification and Optimization

Kempa, Dominik; Policriti, Alberto; Prezza, Nicola; Rotenberg, Eva

doi:10.4230/LIPIcs.ESA.2018.52

File

Author Details

Dominik Kempa

Department of Computer Science, University of Helsinki, Finland

Alberto Policriti

Department of Computer Science, University of Udine, Italy

Nicola Prezza

Department of Computer Science, University of Pisa, Italy

Eva Rotenberg

DTU Compute, Technical University of Denmark, Denmark

Cite As Get BibTex

Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String Attractors: Verification and Optimization. In 26th Annual European Symposium on Algorithms (ESA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 112, pp. 52:1-52:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ESA.2018.52

Abstract

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set Gamma subseteq [1..n] is a k-attractor for a string S in Sigma^n if and only if every distinct substring of S of length at most k has an occurrence crossing at least one of the positions in Gamma. Finding the smallest k-attractor is NP-hard for k >= 3, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the k-attractor problem to a set-cover instance where the string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a k-attractor in near-optimal time and how to quickly compute exact solutions. For example, we prove that a minimum 3-attractor can be found in O(n) time when |Sigma| in O(sqrt[3+epsilon]{log n}) for some constant epsilon > 0, despite the problem being NP-hard for large Sigma.

Subject Classification

ACM Subject Classification

Theory of computation → Data compression

Keywords

Dictionary compression
String attractors
Set cover

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Annual Symposium on Theory of Computing (STOC), pages 148-193. ACM, 2014.
Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. arXiv preprint 1609.06378, 2016.
Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578-595, 1987.
Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
David Clark. Compact Pat trees. PhD thesis, University of Waterloo, 1998.
Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching (CPM), pages 116-129. Springer, 1997.
Martin Farach. Optimal suffix tree construction with large alphabets. In Annual Symposium on Foundations of Computer Science (FOCS), pages 137-143. IEEE, 1997.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005.
Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465-492, 2011.
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Annual Symposium on Discrete Algorithms (SODA), pages 841-850. SIAM, 2003.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005.
Prosenjit Gupta, Ravi Janardan, and Michiel Smid. Further results on generalized intersection searching problems: counting, reporting, and dynamization. Journal of Algorithms, 19(2):282-317, 1995.
Guy Joseph Jacobson. Succinct static data structures. PhD thesis, Carnegie Mellon University, 1988.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918-936, 2006.
Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String attractors: Verification and optimization. arXiv, 2018. URL: http://arxiv.org/abs/1803.01695.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Annual Symposium on Theory of Computing (STOC), pages 827-840. ACM, 2018.
John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000.
Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
Silvio Micali and Vijay V. Vazirani. An O(√|V|⋅ |E|) algorithm for finding maximum matching in general graphs. In Annual Symposium on Foundations of Computer Science (SFCS), pages 17-27. IEEE, 1980.
J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Annual Symposium on Discrete Algorithms (SODA), pages 408-424. SIAM, 2017.
Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 25:2-20, 2014.
Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the ACM, 29(4):928-951, 1982.
Peter Weiner. Linear pattern matching algorithms. In Annual Symposium on Switching and Automata Theory (SWAT/FOCS), pages 1-11. IEEE, 1973.

String Attractors: Verification and Optimization

Authors Dominik Kempa , Alberto Policriti , Nicola Prezza , Eva Rotenberg

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

String Attractors: Verification and Optimization

Authors Dominik Kempa , Alberto Policriti , Nicola Prezza , Eva Rotenberg

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message