Substring Complexity in Sublinear Space

Bernardini, Giulia; Fici, Gabriele; Gawrychowski, Paweł; Pissis, Solon P.

doi:10.4230/LIPIcs.ISAAC.2023.12

Abstract

Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in 𝒪(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an 𝒪(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using 𝒪(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - 𝒪((n³log b)/b²) time using 𝒪(b) space, for any b ∈ [1,n], in the comparison model. - 𝒪̃(n²/b) time using 𝒪̃(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an 𝒪̃(n^{1+ε})-time and 𝒪̃(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities.

Amihood Amir, Gad M. Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Trans. Algorithms, 3(2):19, 2007. URL: https://doi.org/10.1145/1240233.1240242.
Alberto Apostolico, Maxime Crochemore, Martin Farach-Colton, Zvi Galil, and S. Muthukrishnan. 40 years of suffix trees. Commun. ACM, 59(4):66-73, 2016. URL: https://doi.org/10.1145/2810036.
Lorraine A. K. Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Héliou, and Solon P. Pissis. Constructing antidictionaries in output-sensitive space. In 29th Data Compression Conference (DCC), pages 538-547, 2019. URL: https://doi.org/10.1109/DCC.2019.00062.
Paul Beame. A general sequential time-space tradeoff for finding unique elements. SIAM J. Comput., 20(2):270-277, 1991. URL: https://doi.org/10.1137/0220017.
Paul Beame, Raphaël Clifford, and Widad Machmouchi. Element distinctness, frequency moments, and sliding windows. In 54th Symposium on Foundations of Computer Science (FOCS), pages 290-299, 2013. URL: https://doi.org/10.1109/FOCS.2013.39.
Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In 46th Symposium on Theory of Computing, (STOC), pages 148-193, 2014. URL: https://doi.org/10.1145/2591796.2591885.
Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In 4th Latin American Symposium (LATIN), pages 88-94, 2000. URL: https://doi.org/10.1007/10719839_9.
Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Even faster elastic-degenerate string matching via fast matrix multiplication. In 46th International Colloquium on Automata, Languages, and Programming, (ICALP), volume 132 of LIPIcs, pages 21:1-21:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICALP.2019.21.
Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Elastic-degenerate string matching via fast matrix multiplication. SIAM J. Comput., 51(3):549-576, 2022. URL: https://doi.org/10.1137/20m1368033.
Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In 31st Symposium on Discrete Algorithms, (SODA), pages 607-626. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
Allan Borodin and Stephen A. Cook. A time-space tradeoff for sorting on a general sequential model of computation. SIAM J. Comput., 11(2):287-297, 1982. URL: https://doi.org/10.1137/0211022.
Dany Breslauer and Zvi Galil. Real-time streaming string-matching. ACM Trans. Algorithms, 10(4):22:1-22:12, 2014. URL: https://doi.org/10.1145/2635814.
Dany Breslauer, Roberto Grossi, and Filippo Mignosi. Simple real-time constant-space string matching. Theoret. Comput. Sci., 483:2-9, 2013. URL: https://doi.org/10.1016/j.tcs.2012.11.040.
Arturo Carpi and Aldo de Luca. Words and special factors. Theoret. Comput. Sci., 259(1-2):145-182, 2001. URL: https://doi.org/10.1016/S0304-3975(99)00334-5.
Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. Approximating text-to-pattern Hamming distances. In 52nd Symposium on Theory of Computing (STOC), pages 643-656, 2020. URL: https://doi.org/10.1145/3357713.3384266.
Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Linear-time algorithm for long LCF with k mismatches. In 29th Symposium on Combinatorial Pattern Matching (CPM), pages 23:1-23:16, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.23.
Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput., 17(3):427-462, 1988. URL: https://doi.org/10.1137/0217026.
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1-8:39, 2021. URL: https://doi.org/10.1145/3426473.
Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. Dictionary matching in a stream. In 23rd Annual European Symposium on Algorithms (ESA), pages 361-372, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_31.
Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. The k-mismatch problem revisited. In 37th Symposium on Discrete Algorithms (SODA), pages 2039-2052, 2016. URL: https://doi.org/10.1137/1.9781611974331.ch142.
Raphaël Clifford, Tomasz Kociumaka, and Ely Porat. The streaming k-mismatch problem. In 30th Symposium on Discrete Algorithms (SODA), pages 1106-1125, 2019. URL: https://doi.org/10.1137/1.9781611975482.68.
Raphaël Clifford and Tatiana Starikovskaya. Approximate Hamming distance in a stream. In 43rd International Colloquium on Automata, Languages, and Programming, (ICALP), pages 20:1-20:14, 2016. URL: https://doi.org/10.4230/LIPIcs.ICALP.2016.20.
Richard Cole, Tsvi Kopelowitz, and Moshe Lewenstein. Suffix trays and suffix trists: Structures for faster text indexing. Algorithmica, 72(2):450-466, 2015. URL: https://doi.org/10.1007/s00453-013-9860-6.
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms on strings. Cambridge University Press, 2007.
Maxime Crochemore, Lucian Ilie, Costas S. Iliopoulos, Marcin Kubica, Wojciech Rytter, and Tomasz Walen. Computing the longest previous factor. Eur. J. Comb., 34(1):15-26, 2013. URL: https://doi.org/10.1016/j.ejc.2012.07.011.
Maxime Crochemore and Dominique Perrin. Two-way string matching. J. ACM, 38(3):651-675, 1991. URL: https://doi.org/10.1145/116825.116845.
Aldo de Luca. On the combinatorics of finite words. Theoret. Comput. Sci., 218(1):13-39, 1999. URL: https://doi.org/10.1016/S0304-3975(98)00248-5.
Jean Pierre Duval. Factorizing words over an ordered alphabet. Journal of Algorithms, 4(4):363-381, 1983.
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Symposium on Foundations of Computer Science (FOCS), pages 137-143, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In 7th Symposium on Combinatorial Pattern Matching (CPM), volume 1075 of Lecture Notes in Computer Science, pages 130-140. Springer, 1996. URL: https://doi.org/10.1007/3-540-61258-0_11.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. URL: https://doi.org/10.1145/1082036.1082039.
Gabriele Fici, Filippo Mignosi, Antonio Restivo, and Marinella Sciortino. Word assembly through minimal forbidden words. Theoret. Comput. Sci., 359(1-3):214-230, 2006. URL: https://doi.org/10.1016/j.tcs.2006.03.006.
Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109-114, 1965. URL: https://doi.org/10.2307/2034009.
Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In 23rd European Symposium on Algorithms (ESA), pages 533-544, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_45.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
Zvi Galil and Joel I. Seiferas. Time-space-optimal string matching. J. Comput. Syst. Sci., 26(3):280-294, 1983. URL: https://doi.org/10.1016/0022-0000(83)90002-8.
Pawel Gawrychowski and Tatiana Starikovskaya. Streaming dictionary matching with mismatches. In 30th Symposium on Combinatorial Pattern Matching (CPM), pages 21:1-21:15, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.21.
Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. The streaming k-mismatch problem: Tradeoffs between space and total time. In 31st Symposium on Combinatorial Pattern Matching (CPM), pages 15:1-15:15, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.15.
Shay Golan, Tsvi Kopelowitz, and Ely Porat. Towards optimal approximate streaming pattern matching by matching multiple patterns in multiple streams. In 45th International Colloquium on Automata, Languages, and Programming (ICALP), pages 65:1-65:16, 2018. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.65.
Shay Golan, Tsvi Kopelowitz, and Ely Porat. Streaming pattern matching with d wildcards. Algorithmica, 81(5):1988-2015, 2019. URL: https://doi.org/10.1007/s00453-018-0521-7.
Shay Golan and Ely Porat. Real-time streaming multi-pattern search for constant alphabet. In 25th Annual European Symposium on Algorithms (ESA), pages 41:1-41:15, 2017. URL: https://doi.org/10.4230/LIPIcs.ESA.2017.41.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378-407, 2005. URL: https://doi.org/10.1137/S0097539702402354.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/cbo9780511574931.
Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput., 38(6):2162-2178, 2009. URL: https://doi.org/10.1137/070685373.
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918-936, 2006. URL: https://doi.org/10.1145/1217856.1217858.
Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In 51st Symposium on Theory of Computing (STOC), pages 756-767, 2019. URL: https://doi.org/10.1145/3313276.3316368.
Dominik Kempa and Tomasz Kociumaka. Breaking the O(n)-barrier in the construction of compressed suffix arrays and suffix trees. In 34th Symposium on Discrete Algorithms, SODA, pages 5122-5202. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977554.ch187.
Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. CoRR, abs/2308.03635, 2023. URL: https://doi.org/10.48550/arXiv.2308.03635.
Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In 50th Symposium on Theory of Computing (STOC), pages 827-840, 2018. URL: https://doi.org/10.1145/3188745.3188814.
John C. Kieffer and En-Hui Yang. Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000. URL: https://doi.org/10.1109/18.841160.
Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ-optimal space. In 15th Latin American Symposium (LATIN), volume 13568 of Lecture Notes in Computer Science, pages 88-103. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-20624-5_6.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In 14th Latin American Symposium (LATIN), volume 12118 of Lecture Notes in Computer Science, pages 207-219. Springer, 2020. URL: https://doi.org/10.1007/978-3-030-61792-9_17.
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074-2092, 2023. URL: https://doi.org/10.1109/TIT.2022.3224382.
Tomasz Kociumaka, Tatiana Starikovskaya, and Hjalte Wedel Vildhøj. Sublinear space algorithms for the longest common substring problem. In 22th European Symposium on Algorithms (ESA), pages 605-617, 2014. URL: https://doi.org/10.1007/978-3-662-44777-2_50.
Andrei N. Kolmogorov. Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 2(1-4):157-168, 1968. URL: https://doi.org/10.1080/00207166808803030.
Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. Lempel-ziv-like parsing in small space. Algorithmica, 82(11):3195-3215, 2020. URL: https://doi.org/10.1007/s00453-020-00722-6.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoret. Comput. Sci., 483:115-133, 2013. URL: https://doi.org/10.1016/j.tcs.2012.02.006.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. URL: https://doi.org/10.1089/cmb.2009.0169.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In 28th Symposium on Discrete Algorithms (SODA), pages 408-424, 2017. URL: https://doi.org/10.1137/1.9781611974782.26.
Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. URL: http://www.cambridge.org/de/academic/subjects/computer-science/algorithmics-complexity-computer-algebra-and-computational-g/compact-data-structures-practical-approach?format=HB.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2022. URL: https://doi.org/10.1145/3434399.
Stav Ben Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus. Time-space tradeoffs for finding a long common substring. In 31st Symposium on Combinatorial Pattern Matching (CPM), pages 5:1-5:14, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.5.
Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In 50th Symposium on Foundations of Computer Science (FOCS), pages 315-323, 2009. URL: https://doi.org/10.1109/FOCS.2009.11.
Jakub Radoszewski and Tatiana Starikovskaya. Streaming k-mismatch with error correcting and applications. Inf. Comput., 271:104513, 2020. URL: https://doi.org/10.1016/j.ic.2019.104513.
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685-709, 2013. URL: https://doi.org/10.1007/s00453-012-9618-6.
Tatiana Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In 24th Symposium on Combinatorial Pattern Matching (CPM), pages 223-234, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_22.
Peter Weiner. Linear pattern matching algorithms. In 14th Symposium on Switching and Automata Theory, pages 1-11, 1973. URL: https://doi.org/10.1109/SWAT.1973.13.
Zong-Da Wu, Tao Jiang, and Wu-Jie Su. Efficient computation of shortest absent words in a genomic sequence. Inf. Process. Lett., 110(14-15):596-601, 2010. URL: https://doi.org/10.1016/j.ipl.2010.05.008.
Andrew Chi-Chih Yao. Near-optimal time-space tradeoff for element distinctness. SIAM J. Comput., 23(5):966-975, 1994. URL: https://doi.org/10.1137/S0097539788148959.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.

Substring Complexity in Sublinear Space

Authors Giulia Bernardini , Gabriele Fici , Paweł Gawrychowski , Solon P. Pissis

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Substring Complexity in Sublinear Space

Authors Giulia Bernardini , Gabriele Fici , Paweł Gawrychowski , Solon P. Pissis

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message