DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2026.36

Time-Optimal Construction of String Synchronizing Sets

Authors: Jonas Ellert and Tomasz Kociumaka

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

A powerful design principle behind many modern string algorithms is local consistency: breaking the symmetry between string positions based on their small contexts so that matching fragments are handled consistently. Among the most influential instantiations of this principle are string synchronizing sets [Kempa & Kociumaka; STOC 2019]. A τ-synchronizing set of a string of length n is a set of O(n/τ) string positions, chosen using their length-2τ contexts, such that (outside of highly periodic regions) every block of τ consecutive positions contains at least one element of the set. Synchronizing sets have found dozens of applications in diverse settings, from quantum and dynamic algorithms to fully compressed computation. In the classic word RAM model, particularly for strings over small alphabets, they enabled faster solutions to core problems in data compression, text indexing, and string similarity. In this work, we show that any string T ∈ [0 .. σ)ⁿ can be preprocessed in O(n log σ / log n) time so that, for any given integer τ ∈ [1 .. n], a τ-synchronizing set of T can be constructed in O((n log τ)/(τ log n)) time. Both bounds are optimal in the word RAM model with machine word size w = Θ(log n), matching the information-theoretic minimum for the input and output sizes, respectively. Previously, constructing a τ-synchronizing set required O(n/τ) time after an O(n)-time preprocessing [Kociumaka, Radoszewski, Rytter, and Waleń; SICOMP 2024], or, in the restricted regime of τ < 0.2 log_σ n, without any preprocessing needed [Kempa & Kociumaka; STOC 2019]. A simple instantiation of our method outputs the synchronizing set as a sorted list in O(n/τ) time, or as a bitmask in O(n/log n) time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in O(1) time and rank queries in O(log ((log τ)/(log log n))) time. The latter complexity matches known unconditional cell-probe lower bounds for τ ≤ n^{1-Ω(1)}. To achieve this, we introduce a general framework for efficiently processing sparse integer sequences via a custom variable-length encoding. We also augment the optimal variant of van Emde Boas trees [Pătraşcu & Thorup; STOC 2006] with a deterministic linear-time construction. When the set is represented as a bitmask under our sparse encoding, the same guarantees for select and rank queries hold after preprocessing in time proportional to the size of our encoding (in words).

Cite as

Jonas Ellert and Tomasz Kociumaka. Time-Optimal Construction of String Synchronizing Sets. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 36:1-36:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{ellert_et_al:LIPIcs.STACS.2026.36,
  author =	{Ellert, Jonas and Kociumaka, Tomasz},
  title =	{{Time-Optimal Construction of String Synchronizing Sets}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{36:1--36:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.36},
  URN =		{urn:nbn:de:0030-drops-255258},
  doi =		{10.4230/LIPIcs.STACS.2026.36},
  annote =	{Keywords: synchronizing sets, local consistency, packed strings}
}

Document

DOI: 10.4230/LIPIcs.STACS.2026.62

Relative Compressed Reverse Suffix Array

Authors: Muhammed Oguzhan Kulekci, Mano Prakash Parthasarathi, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

Suffix trees and suffix arrays are two fundamental data structures in the field of string algorithms. For a string (a.k.a. text or sequence) of length n over an alphabet of size σ, these structures typically require O(nlog n) bits of space. The FM-index provides a compressed representation of the suffix array in ≈ nlog σ bits, allowing for efficient queries on both the suffix array and its inverse array in near logarithmic time. In certain applications, such as approximate pattern matching (i.e., with wildcards, mismatches, edits), there is a need to access the suffix array of a text, as well as the suffix array of text’s reverse. Motivated by this, we explore the possibility of encoding the suffix array of the reversed text in a compact form, assuming the availability of the FM-index for the original text. Our first solution is an O(n)-bit (relative) encoding of the suffix array of the reversed text, with the time for decoding an entry being only O(log^*n) times that of decoding an entry in the text’s suffix array using FM-index. We then demonstrate how to reduce the space to O(n/κ) bits for a parameter κ, while multiplicative factor in time becomes approximately O(κlog^*n+κ³). We can also support inverse suffix array and longest common extension queries on the reversed text. These results are achieved through some careful and non-trivial application of various succinct data structure techniques.

Cite as

Muhammed Oguzhan Kulekci, Mano Prakash Parthasarathi, Rahul Shah, and Sharma V. Thankachan. Relative Compressed Reverse Suffix Array. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 62:1-62:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{kulekci_et_al:LIPIcs.STACS.2026.62,
  author =	{Kulekci, Muhammed Oguzhan and Parthasarathi, Mano Prakash and Shah, Rahul and Thankachan, Sharma V.},
  title =	{{Relative Compressed Reverse Suffix Array}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{62:1--62:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.62},
  URN =		{urn:nbn:de:0030-drops-255512},
  doi =		{10.4230/LIPIcs.STACS.2026.62},
  annote =	{Keywords: String Matching, Text Indexing, Data Structures, Suffix Trees}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2025.9

Small Space Encoding and Recognition of k-Palindromic Prefixes

Authors: Gabriel Bathie, Jonas Ellert, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 359, 36th International Symposium on Algorithms and Computation (ISAAC 2025)

Abstract

Palindromes are non-empty strings that read the same forward and backward. We study the problem of recognizing so-called k-palindromic strings, which can be represented as the concatenation of exactly k palindromes. [Rubinchik and Shur, MFCS 2020] showed that the problem is solvable in linear space and time. We present a read-only algorithm that recognizes all k-palindromic prefixes of a string T of length n in O(n ⋅ 6^{k²} ⋅ log^k n) time and O(6^{k²} ⋅ log^k n) space. As a corollary, we also obtain a read-only algorithm for computing the palindromic length of T, i.e., the smallest k such that T is k-palindromic, in O(n ⋅ 6^{k²} ⋅ log^⌈k/2⌉ n) time and O(6^{k²} ⋅ log^⌈k/2⌉ n) space.

Cite as

Gabriel Bathie, Jonas Ellert, and Tatiana Starikovskaya. Small Space Encoding and Recognition of k-Palindromic Prefixes. In 36th International Symposium on Algorithms and Computation (ISAAC 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 359, pp. 9:1-9:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bathie_et_al:LIPIcs.ISAAC.2025.9,
  author =	{Bathie, Gabriel and Ellert, Jonas and Starikovskaya, Tatiana},
  title =	{{Small Space Encoding and Recognition of k-Palindromic Prefixes}},
  booktitle =	{36th International Symposium on Algorithms and Computation (ISAAC 2025)},
  pages =	{9:1--9:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-408-6},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{359},
  editor =	{Chen, Ho-Lin and Hon, Wing-Kai and Tsai, Meng-Tsung},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2025.9},
  URN =		{urn:nbn:de:0030-drops-249178},
  doi =		{10.4230/LIPIcs.ISAAC.2025.9},
  annote =	{Keywords: palindromic length, read-only algorithms, palindromes}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.60

Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars

Authors: Jannik Olbrich

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.

Cite as

Jannik Olbrich. Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 60:1-60:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{olbrich:LIPIcs.ESA.2025.60,
  author =	{Olbrich, Jannik},
  title =	{{Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{60:1--60:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.60},
  URN =		{urn:nbn:de:0030-drops-245286},
  doi =		{10.4230/LIPIcs.ESA.2025.60},
  annote =	{Keywords: Burrows-Wheeler Transform, Grammar compression}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.45

Bounded Weighted Edit Distance: Dynamic Algorithms and Matching Lower Bounds

Authors: Itai Boneh, Egor Gorbachev, and Tomasz Kociumaka

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The edit distance ed(X,Y) of two strings X,Y ∈ Σ^* is the minimum number of character edits (insertions, deletions, and substitutions) needed to transform X into Y. Its weighted counterpart ed^w(X,Y) minimizes the total cost of edits, where the costs of individual edits, depending on the edit type and the characters involved, are specified using a function w, normalized so that each edit costs at least one. The textbook dynamic-programming procedure, given strings X,Y ∈ Σ^{≤ n} and oracle access to w, computes ed^w(X,Y) in 𝒪(n²) time. Nevertheless, one can achieve better running times if the computed distance, denoted k, is small: 𝒪(n+k²) for unit weights [Landau and Vishkin; JCSS'88] and Õ(n+√{nk³}) for arbitrary weights [Cassis, Kociumaka, Wellnitz; FOCS'23]. In this paper, we study the dynamic version of the weighted edit distance problem, where the goal is to maintain ed^w(X,Y) for strings X,Y ∈ Σ^{≤ n} that change over time, with each update specified as an edit in X or Y. Very recently, Gorbachev and Kociumaka [STOC'25] showed that the unweighted distance ed(X,Y) can be maintained in Õ(k) time per update after Õ(n+k²)-time preprocessing; here, k denotes the current value of ed(X,Y). Their algorithm generalizes to small integer weights, but the underlying approach is incompatible with large weights. Our main result is a dynamic algorithm that maintains ed^w(X,Y) in Õ(k^{3-γ}) time per update after Õ(nk^γ)-time preprocessing. Here, γ ∈ [0,1] is a real trade-off parameter and k ≥ 1 is an integer threshold fixed at preprocessing time, with ∞ returned whenever ed^w(X,Y) > k. We complement our algorithm with conditional lower bounds showing fine-grained optimality of our trade-off for γ ∈ [0.5,1) and justifying our choice to fix k. We also generalize our solution to a much more robust setting while preserving the fine-grained optimal trade-off. Our full algorithm maintains X ∈ Σ^{≤ n} subject not only to character edits but also substring deletions and copy-pastes, each supported in Õ(k²) time. Instead of dynamically maintaining Y, it answers queries that, given any string Y specified through a sequence of 𝒪(k) arbitrary edits transforming X into Y, in Õ(k^{3-γ}) time compute ed^w(X,Y) or report that ed^w(X,Y) > k.

Cite as

Itai Boneh, Egor Gorbachev, and Tomasz Kociumaka. Bounded Weighted Edit Distance: Dynamic Algorithms and Matching Lower Bounds. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 45:1-45:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{boneh_et_al:LIPIcs.ESA.2025.45,
  author =	{Boneh, Itai and Gorbachev, Egor and Kociumaka, Tomasz},
  title =	{{Bounded Weighted Edit Distance: Dynamic Algorithms and Matching Lower Bounds}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{45:1--45:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.45},
  URN =		{urn:nbn:de:0030-drops-245139},
  doi =		{10.4230/LIPIcs.ESA.2025.45},
  annote =	{Keywords: Edit distance, dynamic algorithms, conditional lower bounds}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.49

Morphisms and BWT-Run Sensitivity

Authors: Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

We study how the application of morphisms affects the number r of equal-letter runs in the Burrows–Wheeler Transform (BWT). This parameter has emerged as a key repetitiveness measure in compressed indexing. We focus on the notion of BWT-run sensitivity after application of morphisms. For binary alphabets, we characterize the class of injective morphisms that preserve the number of BWT-runs up to a bounded additive increase by showing that it coincides with the known class of primitivity-preserving morphisms, which are those that map primitive words to primitive words. We further prove that deciding whether a given binary morphism has bounded BWT-run sensitivity is possible in polynomial time with respect to the total length of the images of the two letters. Additionally, we explore new structural and combinatorial properties of synchronizing and recognizable morphisms. These results establish new connections between BWT-based compressibility, code theory, and symbolic dynamics.

Cite as

Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Morphisms and BWT-Run Sensitivity. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 49:1-49:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{fici_et_al:LIPIcs.MFCS.2025.49,
  author =	{Fici, Gabriele and Romana, Giuseppe and Sciortino, Marinella and Urbina, Cristian},
  title =	{{Morphisms and BWT-Run Sensitivity}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{49:1--49:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.49},
  URN =		{urn:nbn:de:0030-drops-241567},
  doi =		{10.4230/LIPIcs.MFCS.2025.49},
  annote =	{Keywords: Burrows-Wheeler transform, BWT-runs, morphism, pure code, repetitiveness}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.36

Counting Distinct Square Substrings in Sublinear Time

Authors: Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

We show that the number of distinct squares in a packed string of length n over an alphabet of size σ can be computed in 𝒪(n/log_{σ}n) time in the word-RAM model of computation. This paper is the first to introduce a sublinear time algorithm for the packed version of squares counting. The packed representation of a string of length n over an alphabet of size σ is given as a sequence of 𝒪(n/ log_{σ} n) machine words in the word-RAM model (a machine word consists of ω ≥ log₂ n bits). Previously it was known how to count distinct squares in 𝒪(n) time [Gusfield and Stoye, JCSS 2004], even for a string over an integer alphabet, see [Crochemore et al., TCS 2014; Bannai et al., CPM 2017; Charalampopoulos et al., SPIRE 2020]. We use techniques of squares extraction from runs described by Crochemore et al. [TCS 2014]. However, the packed model requires novel approaches. In particular, we need an 𝒪(n/log_{σ}n) sized representation of all long-period runs (runs with periods that are Ω(log_{σ}n)) which guarantees sublinear time counting of potentially linearly-many implied squares. The long-period runs with a string period that is periodic itself (called layer runs) are an obstacle, since their number can be Ω(n). Fortunately, the number of all other long-period runs is 𝒪(n/log_{σ}n) and we can construct an implicit representation of all long-period runs in 𝒪(n/log_{σ}n) time by adopting the insights of Amir et al. [ESA 2019], combined with sublinear time tools provided by the PILLAR model of computations in case of packed strings. We count squares in layer runs in sublinear time by exploiting combinatorial properties of types of pyramidally-shaped groups of layer runs. As a by-product, we discover several new structural properties of runs. Another difficulty is to compute, in sublinear time, locations of Lyndon roots of runs in packed strings, which is needed for grouping of runs that can generate equal squares. To overcome this difficulty, we introduce sparse-Lyndon roots which are based on the notion of string synchronizers proposed by Kempa and Kociumaka [STOC 2019].

Cite as

Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Counting Distinct Square Substrings in Sublinear Time. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 36:1-36:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.MFCS.2025.36,
  author =	{Charalampopoulos, Panagiotis and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Counting Distinct Square Substrings in Sublinear Time}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{36:1--36:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.36},
  URN =		{urn:nbn:de:0030-drops-241439},
  doi =		{10.4230/LIPIcs.MFCS.2025.36},
  annote =	{Keywords: square in a string, packed model, run (maximal repetition), Lyndon word}
}

@InProceedings{charalampopoulos_et_al:LIPIcs.MFCS.2025.36,
  author =	{Charalampopoulos, Panagiotis and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Counting Distinct Square Substrings in Sublinear Time}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{36:1--36:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.36},
  URN =		{urn:nbn:de:0030-drops-241439},
  doi =		{10.4230/LIPIcs.MFCS.2025.36},
  annote =	{Keywords: square in a string, packed model, run (maximal repetition), Lyndon word}
}

Document

DOI: 10.4230/OASIcs.Manzini.1

BWT and Combinatorics on Words

Authors: Gabriele Fici, Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, and Marinella Sciortino

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The Burrows-Wheeler Transform (BWT) is a reversible transformation on words (strings) introduced in 1994 in the context of data compression, which is a permutation of the characters in the word. Its clustering effect, i.e., the remarkable property of grouping identical characters (BWT runs) when they share common contexts, has made it a powerful tool for boosting compression performances and enabling efficient pattern searching in highly repetitive string collections. In this chapter, we analyze the Burrows-Wheeler transform under the combinatorial point of view, and we survey known properties and connections with different aspects of combinatorics on words. In particular, we focus on the properties of words in relation to the number of their BWT runs. The value r, which counts the number of BWT runs, impacts both compression performance and indexing efficiency, and is considered a measure to evaluate the above-mentioned clustering effect and, consequently, the repetitiveness of a word. We give an overview of the results relating r to other combinatorial repetitiveness measures related to the factor complexity. The chapter also explores extremal cases of the clustering effect. Finally, some results on the sensitivity of the measure r are considered, where the effects of combinatorial operations are studied, such as reversal, edits, and the application of morphisms.

Cite as

Gabriele Fici, Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, and Marinella Sciortino. BWT and Combinatorics on Words. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 1:1-1:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{fici_et_al:OASIcs.Manzini.1,
  author =	{Fici, Gabriele and Mantaci, Sabrina and Restivo, Antonio and Romana, Giuseppe and Rosone, Giovanna and Sciortino, Marinella},
  title =	{{BWT and Combinatorics on Words}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{1:1--1:23},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.1},
  URN =		{urn:nbn:de:0030-drops-239090},
  doi =		{10.4230/OASIcs.Manzini.1},
  annote =	{Keywords: Burrows-Wheeler Transform, Combinatorics on Words, Clustering Effect, BWT Runs}
}

Document

DOI: 10.4230/OASIcs.Manzini.2

A Survey of the Bijective Burrows-Wheeler Transform

Authors: Hideo Bannai, Dominik Köppl, and Zsuzsanna Lipták

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The Bijective BWT (BBWT), conceived by Scott in 2007, later summarized in a preprint by Gil and Scott in 2009 (arXiv 2012), is a variant of the Burrows-Wheeler Transform which is bijective: every string is the BBWT of some string. Indeed, the BBWT of a string is the extended BWT [Mantaci et al., 2007] of the factors of its Lyndon factorization. The BBWT has been receiving increasing interest in recent years. In this paper, we survey existing research on the BBWT, starting with its history and motivation. We then present algorithmic topics including construction algorithms with various complexities and an index on top of the BBWT for pattern matching. We subsequently address some properties of the BBWT as a compressor, discussing robustness to operations such as reversal, edits, rotation, as well as compression power. We close with listing other bijective variants of the BWT and open problems concerning the BBWT.

Cite as

Hideo Bannai, Dominik Köppl, and Zsuzsanna Lipták. A Survey of the Bijective Burrows-Wheeler Transform. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 2:1-2:26, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:OASIcs.Manzini.2,
  author =	{Bannai, Hideo and K\"{o}ppl, Dominik and Lipt\'{a}k, Zsuzsanna},
  title =	{{A Survey of the Bijective Burrows-Wheeler Transform}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{2:1--2:26},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.2},
  URN =		{urn:nbn:de:0030-drops-239100},
  doi =		{10.4230/OASIcs.Manzini.2},
  annote =	{Keywords: Burrows-Wheeler Transform, compression, text indexing, repetitiveness measure, Lyndon words, index construction algorithms, bijective string transformation}
}

Document

DOI: 10.4230/OASIcs.Manzini.5

FM-Adaptive: A Practical Data-Aware FM-Index

Authors: Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The FM-index provides an important solution for efficient retrieval and search in textual big data. Its variants have been widely used in many fields including information retrieval, genome analysis, and web searching. In this paper, we propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap γ-encoding. Our theoretical analysis shows that the new index, called FM-Adaptive, achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time based upon memoization, which takes advantage of the overlapping subproblems property. We design the lookup table for accelerated decoding to support fast pattern matching in a text. Extensive experiments demonstrate that FM-Adaptive provides faster query performance, often by a considerable amount, and/or comparable or better compression than other state-of-the-art FM-index methods.

Cite as

Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter. FM-Adaptive: A Practical Data-Aware FM-Index. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 5:1-5:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{huo_et_al:OASIcs.Manzini.5,
  author =	{Huo, Hongwei and He, Zongtao and Liu, Pengfei and Vitter, Jeffrey Scott},
  title =	{{FM-Adaptive: A Practical Data-Aware FM-Index}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{5:1--5:23},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.5},
  URN =		{urn:nbn:de:0030-drops-239139},
  doi =		{10.4230/OASIcs.Manzini.5},
  annote =	{Keywords: Text indexing, Burrows-Wheeler transform, Compressed wavelet trees, Entropy-compressed, Compressed data structures}
}

Document

DOI: 10.4230/OASIcs.Manzini.4

Wavelet Tree, Part II: Text Indexing

Authors: Paolo Ferragina, Raffaele Giancarlo, Roberto Grossi, Giovanna Rosone, Rossano Venturini, and Jeffrey Scott Vitter

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The Wavelet Tree data structure introduced in Grossi, Gupta, and Vitter [Grossi et al., 2003] is a space-efficient technique for rank and select queries that generalizes from binary symbols to an arbitrary multisymbol alphabet. Over the last two decades, it has become a pivotal tool in modern full-text indexing and data compression because of its properties and capabilities in compressing and indexing data, with many applications to information retrieval, genome analysis, data mining, and web search. In this paper, we survey the fascinating history and impact of Wavelet Trees; no doubt many more developments are yet to come. Our survey borrows some content from the authors' earlier works. This paper is divided into two parts: The first part gives a brief history of Wavelet Trees, including its varieties and practical implementations, which appears in the Festschrift dedicated to Roberto Grossi; the second part (this one) deals with Wavelet Tree-based text indexing and is included in the Festschrift dedicated to Giovanni Manzini.

Cite as

Paolo Ferragina, Raffaele Giancarlo, Roberto Grossi, Giovanna Rosone, Rossano Venturini, and Jeffrey Scott Vitter. Wavelet Tree, Part II: Text Indexing. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 4:1-4:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ferragina_et_al:OASIcs.Manzini.4,
  author =	{Ferragina, Paolo and Giancarlo, Raffaele and Grossi, Roberto and Rosone, Giovanna and Venturini, Rossano and Vitter, Jeffrey Scott},
  title =	{{Wavelet Tree, Part II: Text Indexing}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{4:1--4:10},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.4},
  URN =		{urn:nbn:de:0030-drops-239127},
  doi =		{10.4230/OASIcs.Manzini.4},
  annote =	{Keywords: Wavelet tree, data compression, text indexing, compressed suffix array, Burrows-Wheeler transform, rank and select}
}

Document

DOI: 10.4230/OASIcs.Manzini.11

Circular Dictionary Matching Using Extended BWT

Authors: Wing-Kai Hon, Rahul Shah, and Sharma V. Thankachan

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The dictionary matching problem involves preprocessing a set of strings (patterns) into a data structure that efficiently identifies all occurrences of these patterns within a query string (text). In this work, we investigate a variation of this problem, termed circular dictionary matching, where the patterns are circular, meaning their cyclic shifts are also considered valid patterns. Such patterns naturally occur in areas such as bioinformatics and computational geometry. Based on the extended Burrows-Wheeler Transformation (eBWT), we design a space-efficient solution for this problem. Specifically, we show that a dictionary of d circular patterns of total length n can be indexed in nlog σ + O(n+dlog n+σ log n) bits of space and support circular dictionary matching on a query text T in O((|T|+occ)log n) time, where σ represents the size of the underlying alphabet and occ represents the output size.

Cite as

Wing-Kai Hon, Rahul Shah, and Sharma V. Thankachan. Circular Dictionary Matching Using Extended BWT. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 11:1-11:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{hon_et_al:OASIcs.Manzini.11,
  author =	{Hon, Wing-Kai and Shah, Rahul and Thankachan, Sharma V.},
  title =	{{Circular Dictionary Matching Using Extended BWT}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{11:1--11:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.11},
  URN =		{urn:nbn:de:0030-drops-239195},
  doi =		{10.4230/OASIcs.Manzini.11},
  annote =	{Keywords: String algorithms, Burrows-Wheeler transformation, suffix trees, succinct data structures}
}

Document

DOI: 10.4230/OASIcs.Manzini.8

A Taxonomy of LCP-Array Construction Algorithms

Authors: Johannes Fischer and Enno Ohlebusch

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The combination of the suffix array and the LCP-array can be used to solve many string processing problems efficiently. We review some of the most important sequential LCP-array construction algorithms in random access memory.

Cite as

Johannes Fischer and Enno Ohlebusch. A Taxonomy of LCP-Array Construction Algorithms. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 8:1-8:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{fischer_et_al:OASIcs.Manzini.8,
  author =	{Fischer, Johannes and Ohlebusch, Enno},
  title =	{{A Taxonomy of LCP-Array Construction Algorithms}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{8:1--8:17},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.8},
  URN =		{urn:nbn:de:0030-drops-239166},
  doi =		{10.4230/OASIcs.Manzini.8},
  annote =	{Keywords: longest common prefix array, suffix array, Burrows-Wheeler transform}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.14

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Authors: Diego Díaz-Domínguez

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

We present compression algorithms designed to process terabyte-sized datasets in parallel. Our approach builds on locally consistent grammars, a lightweight form of compression, combined with simple post-processing techniques to achieve further space reductions. Locally consistent grammar algorithms are suitable for scaling as they need minimal satellite information to compact the text, but they are not inherently parallel. To enable parallelisation, we introduce a novel concept that we call stable local consistency. A grammar algorithm ALG is stable if for any pattern P occurring in a collection 𝒯 = {T_1, T_2, …, T_k}, instances ALG(T_1), ALG(T_2), …, ALG(T_k) independently produce cores for P with the same topology. In a locally consistent grammar, the core of P is a subset of nodes and edges in the parse tree of 𝒯 that remains the same in all the occurrences of P. This feature enables compression, but it only holds if ALG defines a common set of nonterminal symbols for the strings. Stability removes this restriction, allowing us to run ALG(T_1), ALG(T_2), …, ALG(T_k) in parallel and subsequently merge their grammars into a single output equivalent to that of ALG(𝒯). We implemented our ideas and tested them on massive datasets. Our experiments showed that our method could process 7.9 TB of bacterial genomes in around nine hours, using 16 threads and 0.43 bits/symbol of working memory, achieving a compression ratio of 85x.

Cite as

Diego Díaz-Domínguez. Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 14:1-14:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{diazdominguez:LIPIcs.SEA.2025.14,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego},
  title =	{{Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{14:1--14:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.14},
  URN =		{urn:nbn:de:0030-drops-232525},
  doi =		{10.4230/LIPIcs.SEA.2025.14},
  annote =	{Keywords: Grammar compression, locally consistent parsing, hashing}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.15

Succinct Rank Dictionaries Revisited

Authors: Saska Dönges and Simon J. Puglisi

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

We study data structures for representing sets of m elements drawn from the universe [0..n-1] that support access and rank queries. A classical approach to this problem, foundational to the fields of succinct and compact data structures, is to represent the set as a bitvector X of n bits, where X[i] = 1 iff i is a member of the set. Our particular focus in this paper is on structures taking log₂{n choose m} + o(n) bits, which stem from the so-called RRR bitvector scheme (Raman et al., ACM Trans. Alg., 2007). In RRR bitvectors, X is conceptually divided into n/b blocks of b bits each. A block containing c 1 bits is then encoded using log₂ b + log₂{b choose c} bits, where log b bits are used to encode c, and log₂{b choose c} bits are used to say which of the {b choose c} possible combinations the block represents. In all existing RRR implementations the code assigned to a block is its lexicographical rank amongst the {b choose c} combinations of its class. In this paper we explore alternative non-lexicographical assignments of codes to blocks. We show these approaches can lead to faster query times and offer relevant space-time trade-offs in practice compared to state-of-the-art implementations (Gog and Petri, Software, Prac. & Exp., 2014) from the Succinct Data Structures Library.

Cite as

Saska Dönges and Simon J. Puglisi. Succinct Rank Dictionaries Revisited. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 15:1-15:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{donges_et_al:LIPIcs.SEA.2025.15,
  author =	{D\"{o}nges, Saska and Puglisi, Simon J.},
  title =	{{Succinct Rank Dictionaries Revisited}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{15:1--15:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.15},
  URN =		{urn:nbn:de:0030-drops-232530},
  doi =		{10.4230/LIPIcs.SEA.2025.15},
  annote =	{Keywords: data structures, data compression, succinct data structures, compressed data structures, weighted de Bruijn sequence, text indexing, string algorithms}
}

30 Search Results for "Kempa, Dominik"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message