String Attractors: Verification and Optimization

Authors Dominik Kempa , Alberto Policriti , Nicola Prezza , Eva Rotenberg



PDF
Thumbnail PDF

File

LIPIcs.ESA.2018.52.pdf
  • Filesize: 445 kB
  • 13 pages

Document Identifiers

Author Details

Dominik Kempa
  • Department of Computer Science, University of Helsinki, Finland
Alberto Policriti
  • Department of Computer Science, University of Udine, Italy
Nicola Prezza
  • Department of Computer Science, University of Pisa, Italy
Eva Rotenberg
  • DTU Compute, Technical University of Denmark, Denmark

Cite As Get BibTex

Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String Attractors: Verification and Optimization. In 26th Annual European Symposium on Algorithms (ESA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 112, pp. 52:1-52:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ESA.2018.52

Abstract

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set Gamma subseteq [1..n] is a k-attractor for a string S in Sigma^n if and only if every distinct substring of S of length at most k has an occurrence crossing at least one of the positions in Gamma. Finding the smallest k-attractor is NP-hard for k >= 3, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the k-attractor problem to a set-cover instance where the string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a k-attractor in near-optimal time and how to quickly compute exact solutions. For example, we prove that a minimum 3-attractor can be found in O(n) time when |Sigma| in O(sqrt[3+epsilon]{log n}) for some constant epsilon > 0, despite the problem being NP-hard for large Sigma.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • Dictionary compression
  • String attractors
  • Set cover

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Annual Symposium on Theory of Computing (STOC), pages 148-193. ACM, 2014. Google Scholar
  2. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. arXiv preprint 1609.06378, 2016. Google Scholar
  3. Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578-595, 1987. Google Scholar
  4. Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  5. David Clark. Compact Pat trees. PhD thesis, University of Waterloo, 1998. Google Scholar
  6. Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching (CPM), pages 116-129. Springer, 1997. Google Scholar
  7. Martin Farach. Optimal suffix tree construction with large alphabets. In Annual Symposium on Foundations of Computer Science (FOCS), pages 137-143. IEEE, 1997. Google Scholar
  8. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  9. Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465-492, 2011. Google Scholar
  10. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Annual Symposium on Discrete Algorithms (SODA), pages 841-850. SIAM, 2003. Google Scholar
  11. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005. Google Scholar
  12. Prosenjit Gupta, Ravi Janardan, and Michiel Smid. Further results on generalized intersection searching problems: counting, reporting, and dynamization. Journal of Algorithms, 19(2):282-317, 1995. Google Scholar
  13. Guy Joseph Jacobson. Succinct static data structures. PhD thesis, Carnegie Mellon University, 1988. Google Scholar
  14. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918-936, 2006. Google Scholar
  15. Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String attractors: Verification and optimization. arXiv, 2018. URL: http://arxiv.org/abs/1803.01695.
  16. Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Annual Symposium on Theory of Computing (STOC), pages 827-840. ACM, 2018. Google Scholar
  17. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737-754, 2000. Google Scholar
  18. Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  19. Silvio Micali and Vijay V. Vazirani. An O(√|V|⋅ |E|) algorithm for finding maximum matching in general graphs. In Annual Symposium on Foundations of Computer Science (SFCS), pages 17-27. IEEE, 1980. Google Scholar
  20. J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Annual Symposium on Discrete Algorithms (SODA), pages 408-424. SIAM, 2017. Google Scholar
  21. Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 25:2-20, 2014. Google Scholar
  22. Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016. Google Scholar
  23. James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. Journal of the ACM, 29(4):928-951, 1982. Google Scholar
  24. Peter Weiner. Linear pattern matching algorithms. In Annual Symposium on Switching and Automata Theory (SWAT/FOCS), pages 1-11. IEEE, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail