Time-Optimal Construction of String Synchronizing Sets

Ellert, Jonas; Kociumaka, Tomasz

doi:10.4230/LIPIcs.STACS.2026.36

Abstract

A powerful design principle behind many modern string algorithms is local consistency: breaking the symmetry between string positions based on their small contexts so that matching fragments are handled consistently. Among the most influential instantiations of this principle are string synchronizing sets [Kempa & Kociumaka; STOC 2019]. A τ-synchronizing set of a string of length n is a set of O(n/τ) string positions, chosen using their length-2τ contexts, such that (outside of highly periodic regions) every block of τ consecutive positions contains at least one element of the set. Synchronizing sets have found dozens of applications in diverse settings, from quantum and dynamic algorithms to fully compressed computation. In the classic word RAM model, particularly for strings over small alphabets, they enabled faster solutions to core problems in data compression, text indexing, and string similarity.
In this work, we show that any string T ∈ [0 .. σ)ⁿ can be preprocessed in O(n log σ / log n) time so that, for any given integer τ ∈ [1 .. n], a τ-synchronizing set of T can be constructed in O((n log τ)/(τ log n)) time. Both bounds are optimal in the word RAM model with machine word size w = Θ(log n), matching the information-theoretic minimum for the input and output sizes, respectively. Previously, constructing a τ-synchronizing set required O(n/τ) time after an O(n)-time preprocessing [Kociumaka, Radoszewski, Rytter, and Waleń; SICOMP 2024], or, in the restricted regime of τ < 0.2 log_σ n, without any preprocessing needed [Kempa & Kociumaka; STOC 2019].
A simple instantiation of our method outputs the synchronizing set as a sorted list in O(n/τ) time, or as a bitmask in O(n/log n) time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in O(1) time and rank queries in O(log ((log τ)/(log log n))) time. The latter complexity matches known unconditional cell-probe lower bounds for τ ≤ n^{1-Ω(1)}.
To achieve this, we introduce a general framework for efficiently processing sparse integer sequences via a custom variable-length encoding. We also augment the optimal variant of van Emde Boas trees [Pătraşcu & Thorup; STOC 2006] with a deterministic linear-time construction. When the set is represented as a bitmask under our sparse encoding, the same guarantees for select and rank queries hold after preprocessing in time proportional to the size of our encoding (in words).

Shyan Akmal and Ce Jin. Near-optimal quantum algorithms for string problems. Algorithmica, 85(8):2260-2317, 2023. URL: https://doi.org/10.1007/S00453-022-01092-X.
Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Waleń, and Wiktor Zuba. Quasi-linear-time algorithm for longest common circular factor. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 25:1-25:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.25.
Lorraine A. K. Ayad, Grigorios Loukides, and Solon P. Pissis. Text indexing for long patterns using locally consistent anchors. VLDB J., 34(5):58, 2025. URL: https://doi.org/10.1007/S00778-025-00935-7.
Maxim A. Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. Wavelet trees meet suffix trees. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 572-591. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.39.
Michael A. Bender and Martin Farach-Colton. The level ancestor problem simplified. Theor. Comput. Sci., 321(1):5-12, 2004. URL: https://doi.org/10.1016/J.TCS.2003.05.002.
Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 607-626. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.37.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. URL: https://doi.org/10.1186/S13015-019-0148-5.
Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Faster algorithms for longest common substring. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 30:1-30:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.30.
Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Counting distinct square substrings in sublinear time. In Paweł Gawrychowski, Filip Mazowiecki, and Michal Skrzypczak, editors, 50th International Symposium on Mathematical Foundations of Computer Science, MFCS 2025, August 25-29, 2025, Warsaw, Poland, volume 345 of LIPIcs, pages 36:1-36:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2025. URL: https://doi.org/10.4230/LIPIcs.MFCS.2025.36.
Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest palindromic substring in sublinear time. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 20:1-20:9. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.20.
Graham Cormode and S. Muthukrishnan. Substring compression problems. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2005, Vancouver, British Columbia, Canada, January 23-25, 2005, pages 321-330. SIAM, 2005. URL: http://dl.acm.org/citation.cfm?id=1070432.1070478.
Graham Cormode and S. Muthukrishnan. The string edit distance matching problem with moves. ACM Trans. Algorithms, 3(1):2:1-2:19, 2007. URL: https://doi.org/10.1145/1219944.1219947.
Patrick Dinklage, Johannes Fischer, Alexander Herlez, Tomasz Kociumaka, and Florian Kurpicz. Practical performance of space efficient data structures for longest common extensions. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 39:1-39:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ESA.2020.39.
Peter Elias. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory, 21(2):194-203, 1975. URL: https://doi.org/10.1109/TIT.1975.1055349.
Jonas Ellert. Sublinear time Lempel-Ziv (LZ77) factorization. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 171-187. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-43980-3_14.
N. J. Fine and H. S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16:109-114, 1965. URL: https://doi.org/10.1090/S0002-9939-1965-0174934-9.
Ragnar Groot Koerkamp. Optimal Throughput Bioinformatics. PhD thesis, ETH Zurich, 2025. URL: https://doi.org/10.3929/ethz-c-000783091.
Torben Hagerup. Sorting and searching on the word RAM. In Michel Morvan, Christoph Meinel, and Daniel Krob, editors, STACS 98, 15th Annual Symposium on Theoretical Aspects of Computer Science, Paris, France, February 25-27, 1998, Proceedings, volume 1373 of Lecture Notes in Computer Science, pages 366-398. Springer, 1998. URL: https://doi.org/10.1007/BFB0028575.
Artur Jeż. Approximation of grammar-based compression via recompression. Theor. Comput. Sci., 592:115-134, 2015. URL: https://doi.org/10.1016/J.TCS.2015.05.027.
Artur Jeż. Recompression: A simple and powerful technique for word equations. J. ACM, 63(1):4:1-4:51, 2016. URL: https://doi.org/10.1145/2743014.
Ce Jin and Jakob Nogler. Quantum speed-ups for string synchronizing sets, longest common substring, and k-mismatch matching. ACM Trans. Algorithms, 20(4):32:1-32:36, 2024. URL: https://doi.org/10.1145/3672395.
Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
Dominik Kempa and Tomasz Kociumaka. Dynamic suffix array with polylogarithmic queries and updates. In Stefano Leonardi and Anupam Gupta, editors, STOC '22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1657-1670. ACM, 2022. URL: https://doi.org/10.1145/3519935.3520061.
Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91-98, 2022. URL: https://doi.org/10.1145/3531445.
Dominik Kempa and Tomasz Kociumaka. Breaking the O(n)-barrier in the construction of compressed suffix arrays and suffix trees. In Nikhil Bansal and Viswanath Nagarajan, editors, Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pages 5122-5202. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977554.CH187.
Dominik Kempa and Tomasz Kociumaka. Lempel-Ziv (LZ77) factorization in sublinear time. In 65th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2024, Chicago, IL, USA, October 27-30, 2024, pages 2045-2055. IEEE, 2024. URL: https://doi.org/10.1109/FOCS61266.2024.00122.
Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. In Piotr Indyk, editor, Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, pages 532-551. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611973730.36.
Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. SIAM J. Comput., 53(5):1524-1577, 2024. URL: https://doi.org/10.1137/23M1567618.
Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, New York, NY, USA, October 17-18, 1999, pages 596-604. IEEE Computer Society, 1999. URL: https://doi.org/10.1109/SFFCS.1999.814634.
Grigorios Loukides and Solon P. Pissis. Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1-64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.64.
Mamoru Maekawa. A square root N algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst., 3(2):145-159, 1985. URL: https://doi.org/10.1145/214438.214445.
Kurt Mehlhorn, R. Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183-198, 1997. URL: https://doi.org/10.1007/BF02522825.
Mihai Pătraşcu and Mikkel Thorup. Time-space trade-offs for predecessor search. In Jon M. Kleinberg, editor, Proceedings of the 38th Annual ACM Symposium on Theory of Computing, Seattle, WA, USA, May 21-23, 2006, pages 232-240. ACM, 2006. URL: https://doi.org/10.1145/1132516.1132551.
Mihai Pătraşcu and Mikkel Thorup. Time-space trade-offs for predecessor search, 2006. URL: https://doi.org/10.48550/arXiv.cs/0603043.
Jakub Radoszewski and Wiktor Zuba. Computing string covers in sublinear time. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 272-288. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-72200-4_21.
Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Track A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 84-95. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-70575-8_8.
Süleyman Cenk Sahinalp and Uzi Vishkin. On a parallel-algorithms method for string matching problems. In Maurizio A. Bonuccelli, Pierluigi Crescenzi, and Rossella Petreschi, editors, Algorithms and Complexity, Second Italian Conference, CIAC '94, Rome, Italy, February 23-25, 1994, Proceedings, volume 778 of Lecture Notes in Computer Science, pages 22-32. Springer, 1994. URL: https://doi.org/10.1007/3-540-57811-0_3.
Süleyman Cenk Sahinalp and Uzi Vishkin. Symmetry breaking for suffix tree construction. In Frank Thomson Leighton and Michael T. Goodrich, editors, Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, 23-25 May 1994, Montréal, Québec, Canada, pages 300-309. ACM, 1994. URL: https://doi.org/10.1145/195058.195164.
Süleyman Cenk Sahinalp and Uzi Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm (extended abstract). In 37th Annual Symposium on Foundations of Computer Science, FOCS 1996, Burlington, Vermont, USA, 14-16 October, 1996, pages 320-328. IEEE Computer Society, 1996. URL: https://doi.org/10.1109/SFCS.1996.548491.
Saul Schleimer, Daniel Shawcross Wilkerson, and Alex Aiken. Winnowing: Local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76-85. ACM, 2003. URL: https://doi.org/10.1145/872757.872770.
Peter van Emde Boas. Preserving order in a forest in less than logarithmic time. In 16th Annual Symposium on Foundations of Computer Science, Berkeley, California, USA, October 13-15, 1975, pages 75-84. IEEE Computer Society, 1975. URL: https://doi.org/10.1109/SFCS.1975.26.

Time-Optimal Construction of String Synchronizing Sets

Authors Jonas Ellert , Tomasz Kociumaka

Files

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Time-Optimal Construction of String Synchronizing Sets

Authors Jonas Ellert , Tomasz Kociumaka

Files

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message