DROPS

Volume

OASIcs, Volume 131

The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday

Manzini's Festschrift, July 25, 2025, Venice, Italy

Editors: Paolo Ferragina, Travis Gagie, and Gonzalo Navarro

Document

DOI: 10.4230/LIPIcs.STACS.2026.62

Relative Compressed Reverse Suffix Array

Authors: Muhammed Oguzhan Kulekci, Mano Prakash Parthasarathi, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

Suffix trees and suffix arrays are two fundamental data structures in the field of string algorithms. For a string (a.k.a. text or sequence) of length n over an alphabet of size σ, these structures typically require O(nlog n) bits of space. The FM-index provides a compressed representation of the suffix array in ≈ nlog σ bits, allowing for efficient queries on both the suffix array and its inverse array in near logarithmic time. In certain applications, such as approximate pattern matching (i.e., with wildcards, mismatches, edits), there is a need to access the suffix array of a text, as well as the suffix array of text’s reverse. Motivated by this, we explore the possibility of encoding the suffix array of the reversed text in a compact form, assuming the availability of the FM-index for the original text. Our first solution is an O(n)-bit (relative) encoding of the suffix array of the reversed text, with the time for decoding an entry being only O(log^*n) times that of decoding an entry in the text’s suffix array using FM-index. We then demonstrate how to reduce the space to O(n/κ) bits for a parameter κ, while multiplicative factor in time becomes approximately O(κlog^*n+κ³). We can also support inverse suffix array and longest common extension queries on the reversed text. These results are achieved through some careful and non-trivial application of various succinct data structure techniques.

Cite as

Muhammed Oguzhan Kulekci, Mano Prakash Parthasarathi, Rahul Shah, and Sharma V. Thankachan. Relative Compressed Reverse Suffix Array. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 62:1-62:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{kulekci_et_al:LIPIcs.STACS.2026.62,
  author =	{Kulekci, Muhammed Oguzhan and Parthasarathi, Mano Prakash and Shah, Rahul and Thankachan, Sharma V.},
  title =	{{Relative Compressed Reverse Suffix Array}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{62:1--62:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.62},
  URN =		{urn:nbn:de:0030-drops-255512},
  doi =		{10.4230/LIPIcs.STACS.2026.62},
  annote =	{Keywords: String Matching, Text Indexing, Data Structures, Suffix Trees}
}

Document

DOI: 10.4230/LIPIcs.STACS.2026.68

Dynamic Pattern Matching with Wildcards

Authors: Arshia Ataee Naeini, Amir-Parsa Mobed, Masoud Seddighin, and Saeed Seddighin

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

We study the fully dynamic pattern matching problem where the pattern may contain up to k wildcard symbols, each matching any symbol of the alphabet. Both the text and the pattern are subject to updates (insert, delete, change). We design an algorithm with 𝒪(n log² n) preprocessing and update/query time 𝒪̃(kn^{k/{k+1}} + k² log n). The bound is truly sublinear for a constant k, and sublinear when k = o(log n). We further complement our results with a conditional lower bound: assuming subquadratic preprocessing time, achieving truly sublinear update time for the case k = Ω(log n) would contradict the Strong Exponential Time Hypothesis (SETH). Finally, we develop sublinear algorithms for two special cases: - If the pattern contains w non-wildcard symbols, we give an algorithm with preprocessing time 𝒪(nw) and update time 𝒪(w + log n), which is truly sublinear whenever w is truly sublinear. - Using FFT technique combined with block decomposition, we design a deterministic truly sublinear algorithm with preprocessing time 𝒪(n^{1.8}) and update time 𝒪(n^{0.8} log n) for the case that there are at most two non-wildcards.

Cite as

Arshia Ataee Naeini, Amir-Parsa Mobed, Masoud Seddighin, and Saeed Seddighin. Dynamic Pattern Matching with Wildcards. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 68:1-68:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{naeini_et_al:LIPIcs.STACS.2026.68,
  author =	{Naeini, Arshia Ataee and Mobed, Amir-Parsa and Seddighin, Masoud and Seddighin, Saeed},
  title =	{{Dynamic Pattern Matching with Wildcards}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{68:1--68:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.68},
  URN =		{urn:nbn:de:0030-drops-255579},
  doi =		{10.4230/LIPIcs.STACS.2026.68},
  annote =	{Keywords: pattern matching, wildcards, dynamic algorithms, string algorithms, data structures}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2025.25

Structural Parameterizations of Simultaneous Planarity

Authors: Thomas Depian, Simon D. Fink, Alexander Firbas, Robert Ganian, Matthias Pfretzschner, and Ignaz Rutter

Published in: LIPIcs, Volume 359, 36th International Symposium on Algorithms and Computation (ISAAC 2025)

Abstract

Given a set of graphs on the same vertex set, the problem Simultaneous Embedding With Fixed Edges (SEFE) asks, whether there exist planar drawings of all input graphs, such that every pair of drawings coincides on their shared subgraph. It is known that SEFE is NP-complete [Elisabeth Gassner et al., 2006], even in the so-called sunflower case, where all pairs of input graphs have the same shared graph G_∩ [Marcus Schaefer, 2012]. Fink, Pfretzschner, and Rutter [Simon D. Fink et al., 2023] recently initiated the study of the parameterized complexity of SEFE in the sunflower case, mainly focusing on structural parameters of G_∩. In this work, we shift the focus towards parameters of the union graph G_∪ that contains the edges of all input graphs. On the positive side, we establish fixed-parameter tractability for the problem with respect to the feedback edge set number of G_∪. We complement this result by showing that it, surprisingly, remains NP-complete even if G_∪ has constant vertex cover number. These results settle two open questions posed by Fink et al. [Simon D. Fink et al., 2023].

Cite as

Thomas Depian, Simon D. Fink, Alexander Firbas, Robert Ganian, Matthias Pfretzschner, and Ignaz Rutter. Structural Parameterizations of Simultaneous Planarity. In 36th International Symposium on Algorithms and Computation (ISAAC 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 359, pp. 25:1-25:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{depian_et_al:LIPIcs.ISAAC.2025.25,
  author =	{Depian, Thomas and Fink, Simon D. and Firbas, Alexander and Ganian, Robert and Pfretzschner, Matthias and Rutter, Ignaz},
  title =	{{Structural Parameterizations of Simultaneous Planarity}},
  booktitle =	{36th International Symposium on Algorithms and Computation (ISAAC 2025)},
  pages =	{25:1--25:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-408-6},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{359},
  editor =	{Chen, Ho-Lin and Hon, Wing-Kai and Tsai, Meng-Tsung},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2025.25},
  URN =		{urn:nbn:de:0030-drops-249332},
  doi =		{10.4230/LIPIcs.ISAAC.2025.25},
  annote =	{Keywords: SEFE, Simultaneous Planarity, Fixed-Parameter Tractability, NP-hardness}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2025.31

Compressibility Measures and Succinct Data Structures for Piecewise Linear Approximations

Authors: Paolo Ferragina and Filippo Lari

Published in: LIPIcs, Volume 359, 36th International Symposium on Algorithms and Computation (ISAAC 2025)

Abstract

We study the problem of deriving compressibility measures for Piecewise Linear Approximations (PLAs), i.e., error-bounded approximations of a set of two-dimensional increasing data points using a sequence of segments. Such approximations are widely used tools in implementing many learned data structures, which mix learning models with traditional algorithmic design blocks to exploit regularities in the underlying data distribution, providing novel and effective space-time trade-offs. We introduce the first lower bounds to the cost of storing PLAs in two settings, namely compression and indexing. We then compare these compressibility measures to known data structures, and show that they are asymptotically optimal up to a constant factor from the space lower bounds. Finally, we design the first data structures for the aforementioned settings that achieve the space lower bounds plus small additive terms, which turn out to be succinct in most practical cases. Our data structures support the efficient retrieval and evaluation of a segment in the (compressed) PLA for a given x-value, which is a core operation in any learned data structure relying on PLAs. As a result, our paper offers the first theoretical analysis of the maximum compressibility achievable by PLA-based learned data structures, and provides novel storage schemes for PLAs offering strong theoretical guarantees while also suggesting simple and efficient practical implementations.

Cite as

Paolo Ferragina and Filippo Lari. Compressibility Measures and Succinct Data Structures for Piecewise Linear Approximations. In 36th International Symposium on Algorithms and Computation (ISAAC 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 359, pp. 31:1-31:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ferragina_et_al:LIPIcs.ISAAC.2025.31,
  author =	{Ferragina, Paolo and Lari, Filippo},
  title =	{{Compressibility Measures and Succinct Data Structures for Piecewise Linear Approximations}},
  booktitle =	{36th International Symposium on Algorithms and Computation (ISAAC 2025)},
  pages =	{31:1--31:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-408-6},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{359},
  editor =	{Chen, Ho-Lin and Hon, Wing-Kai and Tsai, Meng-Tsung},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2025.31},
  URN =		{urn:nbn:de:0030-drops-249397},
  doi =		{10.4230/LIPIcs.ISAAC.2025.31},
  annote =	{Keywords: Piecewise Linear Approximations, Succinct Data Structures, Lower Bounds}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.47

Fast and Lightweight Distributed Suffix Array Construction

Authors: Manuel Haag, Florian Kurpicz, Peter Sanders, and Matthias Schimek

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The suffix array contains the lexicographical order of all suffixes of a text. It is one of the most well-studied text indices with applications in bioinformatics, compression, and pattern matching. The main bottleneck of distributed-memory suffix array construction algorithms is their memory requirements. Even careful implementations require 30×-60× the input size as working memory. We present a scalable and lightweight distributed-memory adaptation of the difference cover (DCX) suffix array construction algorithm. Our approach relies on novel bucketing and random chunk redistribution techniques which reduce our memory requirement to 20×-26× the input size for medium-sized inputs and to 14×-15× for large-sized inputs. Regarding running time, we achieve speedups of up to 5× over current state-of-the-art distributed suffix array construction algorithms.

Cite as

Manuel Haag, Florian Kurpicz, Peter Sanders, and Matthias Schimek. Fast and Lightweight Distributed Suffix Array Construction. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 47:1-47:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{haag_et_al:LIPIcs.ESA.2025.47,
  author =	{Haag, Manuel and Kurpicz, Florian and Sanders, Peter and Schimek, Matthias},
  title =	{{Fast and Lightweight Distributed Suffix Array Construction}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{47:1--47:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.47},
  URN =		{urn:nbn:de:0030-drops-245154},
  doi =		{10.4230/LIPIcs.ESA.2025.47},
  annote =	{Keywords: Distributed Computing, Suffix Array Construction}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.94

Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime

Authors: Tomasz Kociumaka and Ali Shahali

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The tree edit distance is a natural dissimilarity measure between rooted ordered trees whose nodes are labeled over an alphabet Σ. It is defined as the minimum number of node edits - insertions, deletions, and relabelings - required to transform one tree into the other. The weighted variant assigns costs ≥ 1 to edits (based on node labels), minimizing total cost rather than edit count. The unweighted tree edit distance between two trees of total size n can be computed in 𝒪(n^{2.6857}) time; in contrast, determining the weighted tree edit distance is fine-grained equivalent to the All-Pairs Shortest Paths (APSP) problem and requires n³/2^Ω(√{log n}) time [Nogler, Polak, Saha, Vassilevska Williams, Xu, Ye; STOC'25]. These impractical super-quadratic times for large, similar trees motivate the bounded version, parameterizing runtime by the distance k to enable faster algorithms for k ≪ n. Prior algorithms for bounded unweighted edit distance achieve 𝒪(nk²log n) [Akmal & Jin; ICALP’21] and 𝒪(n + k⁷log k) [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC'23]. For weighted, only 𝒪(n + k^{15}) is known [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC'23]. We present an 𝒪(n + k⁶ log k)-time algorithm for bounded tree edit distance in both weighted/unweighted settings. First, we devise a simpler weighted 𝒪(nk² log n)-time algorithm. Next, we exploit periodic structures in input trees via an optimized universal kernel: modifying prior 𝒪(n)-time 𝒪(k⁵)-size kernels to generate such structured instances, enabling efficient analysis.

Cite as

Tomasz Kociumaka and Ali Shahali. Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 94:1-94:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{kociumaka_et_al:LIPIcs.ESA.2025.94,
  author =	{Kociumaka, Tomasz and Shahali, Ali},
  title =	{{Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{94:1--94:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.94},
  URN =		{urn:nbn:de:0030-drops-245634},
  doi =		{10.4230/LIPIcs.ESA.2025.94},
  annote =	{Keywords: tree edit distance, edit distance, kernelization, dynamic programming}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.5

External-Memory Priority Queues with Optimal Insertions

Authors: Gerth Stølting Brodal, Michael T. Goodrich, John Iacono, Jared Lo, Ulrich Meyer, Victor Pagan, Nodari Sitchinava, and Rolf Svenning

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

We present an external-memory priority queue structure supporting Insert and DeleteMin with amortized 𝒪(1) and 𝒪(lg N) comparisons, respectively, and amortized 𝒪(1/B) and 𝒪(1/B log_{M/B} N/B) I/Os, respectively. Here, M is the size of the internal memory, B is the block size of I/Os between internal and external memory, and N is the number of elements in the priority queue just before an operation is performed. Previous external-memory priority queues required amortized 𝒪(lg N) comparisons and 𝒪(1/B log_{M/B} N/B) I/Os for both Insert and DeleteMin. The construction requires the minimal assumption M ≥ 2B.

Cite as

Gerth Stølting Brodal, Michael T. Goodrich, John Iacono, Jared Lo, Ulrich Meyer, Victor Pagan, Nodari Sitchinava, and Rolf Svenning. External-Memory Priority Queues with Optimal Insertions. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 5:1-5:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{brodal_et_al:LIPIcs.ESA.2025.5,
  author =	{Brodal, Gerth St{\o}lting and Goodrich, Michael T. and Iacono, John and Lo, Jared and Meyer, Ulrich and Pagan, Victor and Sitchinava, Nodari and Svenning, Rolf},
  title =	{{External-Memory Priority Queues with Optimal Insertions}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{5:1--5:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.5},
  URN =		{urn:nbn:de:0030-drops-244734},
  doi =		{10.4230/LIPIcs.ESA.2025.5},
  annote =	{Keywords: priority queues, external memory, cache aware, amortized complexity}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.18

Efficiency of Learned Indexes on Genome Spectra

Authors: Md. Hasin Abrar, Paul Medvedev, and Giorgio Vinciguerra

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

Data structures on a multiset of genomic k-mers are at the heart of many bioinformatic tools. As genomic datasets grow in scale, the efficiency of these data structures increasingly depends on how well they leverage the inherent patterns in the data. One recent and effective approach is the use of learned indexes that approximate the rank function of a multiset using a piecewise linear function with very few segments. However, theoretical worst-case analysis struggles to predict the practical performance of these indexes. We address this limitation by developing a novel measure of piecewise-linear approximability of the data, called CaPLa (Canonical Piecewise Linear approximability). CaPLa builds on the empirical observation that a power-law model often serves as a reasonable proxy for piecewise linear-approximability, while explicitly accounting for deviations from a true power-law fit. We prove basic properties of CaPLa and present an efficient algorithm to compute it. We then demonstrate that CaPLa can accurately predict space bounds for data structures on real data. Empirically, we analyze over 500 genomes through the lens of CaPLa, revealing that it varies widely across the tree of life and even within individual genomes. Finally, we study the robustness of CaPLa as a measure and the factors that make genomic k-mer multisets different from random ones.

Cite as

Md. Hasin Abrar, Paul Medvedev, and Giorgio Vinciguerra. Efficiency of Learned Indexes on Genome Spectra. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 18:1-18:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{abrar_et_al:LIPIcs.ESA.2025.18,
  author =	{Abrar, Md. Hasin and Medvedev, Paul and Vinciguerra, Giorgio},
  title =	{{Efficiency of Learned Indexes on Genome Spectra}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{18:1--18:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.18},
  URN =		{urn:nbn:de:0030-drops-244865},
  doi =		{10.4230/LIPIcs.ESA.2025.18},
  annote =	{Keywords: Genome spectra, piecewise linear approximation, learned index, k-mers}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.60

Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars

Authors: Jannik Olbrich

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.

Cite as

Jannik Olbrich. Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 60:1-60:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{olbrich:LIPIcs.ESA.2025.60,
  author =	{Olbrich, Jannik},
  title =	{{Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{60:1--60:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.60},
  URN =		{urn:nbn:de:0030-drops-245286},
  doi =		{10.4230/LIPIcs.ESA.2025.60},
  annote =	{Keywords: Burrows-Wheeler Transform, Grammar compression}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.64

A Dynamic Piecewise-Linear Geometric Index with Worst-Case Guarantees

Authors: Emil Toftegaard Gæde, Ivor van der Hoog, Eva Rotenberg, and Tord Stordalen

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

Indexing data is a fundamental problem in computer science. The input is a set S of n distinct integers from a universe 𝒰. Indexing queries take a value q ∈ 𝒰 and return the membership, predecessor or rank of q in S. A range query takes two values q, r ∈ 𝒰 and returns the set S ∩ [q,r]. Recently, various papers study a special case where the the input data behaves in an approximately piece-wise linear way. Given the sorted (rank,value) pairs, and given some constant ε, one wants to maintain a small number of axis-disjoint line-segments such that, for each rank, the value is within ± ε of the corresponding line-segment. Ferragina and Vinciguerra (VLDB 2020) observe that this geometric problem is useful for solving indexing problems, particularly when the number of line-segments is small compared to the size of the dataset. We study the dynamic version of this geometric problem. In the dynamic setting, inserting or deleting just one data point may cause up to three line-segments to be merged, or one line-segment to be split at most three-way. To determine and compute this, we use techniques from dynamic maintenance of convex hulls, and provide new algorithms with worst-case guarantees, including an O(log n) algorithm to compute a separating line between two non-intersecting convex hulls - an operation previously missing from the literature. We then use our fully-dynamic geometry-based subroutine in an indexing data structure, combining it with a natural hashing technique. The resulting indexing data structure has theoretically efficient worst-case guarantees in expectation. We compare its practical performance to the solution of Ferragina and Vinciguerra, which was shown to perform better in certain structured settings [Sun, Zhou, Li VLDB 2023]. Our empirical analysis shows that our solution supports more efficient range queries in the special case where the update sequence contains many deletions.

Cite as

Emil Toftegaard Gæde, Ivor van der Hoog, Eva Rotenberg, and Tord Stordalen. A Dynamic Piecewise-Linear Geometric Index with Worst-Case Guarantees. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 64:1-64:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{gaede_et_al:LIPIcs.ESA.2025.64,
  author =	{G{\ae}de, Emil Toftegaard and van der Hoog, Ivor and Rotenberg, Eva and Stordalen, Tord},
  title =	{{A Dynamic Piecewise-Linear Geometric Index with Worst-Case Guarantees}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{64:1--64:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.64},
  URN =		{urn:nbn:de:0030-drops-245323},
  doi =		{10.4230/LIPIcs.ESA.2025.64},
  annote =	{Keywords: Algorithms Engineering, Data Structures, Indexing, Convex Hulls}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.48

Generalized De Bruijn Words, Invertible Necklaces, and the Burrows-Wheeler Transform

Authors: Gabriele Fici and Estéban Gabory

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

We define generalized de Bruijn words as those words having a Burrows-Wheeler transform that is a concatenation of permutations of the alphabet. We show that generalized de Bruijn words are in 1-to-1 correspondence with Hamiltonian cycles in the generalized de Bruijn graphs, introduced in the early '80s in the context of network design. When the size of the alphabet is a prime p, we define invertible necklaces as those whose BWT-matrix is non-singular. We show that invertible necklaces of length n correspond to normal bases of the finite field 𝔽_{pⁿ}, and that they form an Abelian group isomorphic to the Reutenauer group RG_pⁿ. Using known results in abstract algebra, we can make a bridge between generalized de Bruijn words and invertible necklaces. In particular, we highlight a correspondence between binary de Bruijn words of order d+1, binary necklaces of length 2^{d} having an odd number of 1’s, invertible BWT matrices of size 2^{d}× 2^{d}, and normal bases of the finite field 𝔽_{2^{2^{d}}}.

Cite as

Gabriele Fici and Estéban Gabory. Generalized De Bruijn Words, Invertible Necklaces, and the Burrows-Wheeler Transform. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 48:1-48:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{fici_et_al:LIPIcs.MFCS.2025.48,
  author =	{Fici, Gabriele and Gabory, Est\'{e}ban},
  title =	{{Generalized De Bruijn Words, Invertible Necklaces, and the Burrows-Wheeler Transform}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{48:1--48:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.48},
  URN =		{urn:nbn:de:0030-drops-241555},
  doi =		{10.4230/LIPIcs.MFCS.2025.48},
  annote =	{Keywords: Burrows-Wheeler Transform, Generalized de Bruijn Word, Generalized de Bruijn Graph, Circulant Matrix, Invertible Necklace, Sandpile Group, Reutenauer Group}
}

@InProceedings{fici_et_al:LIPIcs.MFCS.2025.48,
  author =	{Fici, Gabriele and Gabory, Est\'{e}ban},
  title =	{{Generalized De Bruijn Words, Invertible Necklaces, and the Burrows-Wheeler Transform}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{48:1--48:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.48},
  URN =		{urn:nbn:de:0030-drops-241555},
  doi =		{10.4230/LIPIcs.MFCS.2025.48},
  annote =	{Keywords: Burrows-Wheeler Transform, Generalized de Bruijn Word, Generalized de Bruijn Graph, Circulant Matrix, Invertible Necklace, Sandpile Group, Reutenauer Group}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.17

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Authors: Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold ℒ. These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Cite as

Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{sanaullah_et_al:LIPIcs.WABI.2025.17,
  author =	{Sanaullah, Ahsan and Zhi, Degui and Zhang, Shaojie},
  title =	{{An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{17:1--17:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.17},
  URN =		{urn:nbn:de:0030-drops-239433},
  doi =		{10.4230/LIPIcs.WABI.2025.17},
  annote =	{Keywords: BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.WABI.2025.2

We Are What We Index; a Primer for the Wheeler Graph Era (Invited Talk)

Authors: Ben Langmead

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Since the arrival of second-generation sequencing, we have needed to build indexes over reference sequences - e.g. genomes and transcriptomes - in order to solve read alignment and classification problems efficiently [Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2009]. The rule has been: what we can index determines what we can do. When indexing strings, we can use methods like suffix arrays [Manber and Myers, 1993], the Burrows-Wheeler Transform (BWT) [Burrows and Wheeler, 1994] / FM Index [Ferragina and Manzini, 2000], or k-mer indexes [Marchet et al., 2021]. What if we want to index objects more complex than strings? A pangenome, for example, is a large collection of similar strings, e.g. the hundreds of assemblies that make up the Human Pangenome Reference [Liao et al., 2023] or all the bacteria in the Refseq database [Goldfarb et al., 2025]. We may wish to combine these strings into a multiple sequence alignment (MSA) or a graph first. Can we index those efficiently? In many useful cases the answer is "yes," but in others the answer is "no." The story of how we learned exactly when the answer is "yes" versus "no" unfolded through a sequence of insights. Here we review this story, eventually arriving at the definition of Wheeler graphs as discovered and formalized by Gagie, Manzini and Sirén [Gagie et al., 2017]. We will focus on indexes based on the BWT, since these (a) are lossless full-text indexes, (b) are widely used in practice [Langmead et al., 2009; Li and Durbin, 2009], and (c) form the theoretical throughline for all the indexing strategies on the path to Wheeler graphs. We will trace the BWT-based indexing story from the early days of the FM Index, though its step-by-step gobbling up of trees (XBW-transform [Ferragina et al., 2005]) and de Bruijn Graphs (BOSS representation [Bowe et al., 2012]), and to the eventual formalization of Wheeler graphs [Gagie et al., 2017]. Along the way, we will define and update our notions of what it means to track a consecutive range of elements in the structure, and what it means for an index to be efficient. We will also connect these notions to automata [Sipser, 1996], noting how the indexability of Wheeler graphs (also called Wheeler automata) is connected to the mechanics of how to efficiently represent and simulate a finite automaton [Alanko et al., 2021]. With this context, we can imagine improved indexes for the future of genomics and pangenomics. De Bruijn are extremely practical and are the most widely used among the non-string data structures that are also Wheeler graphs. But we might prefer other options. For example, de Bruijn graphs have the undesirable property that they usually encode not only the true longer-than-k substrings of the original text, but also "false" substrings that span repeats. Related to this, paths through the de Bruijn graph can "glue" substrings together that are horizontally distant in the MSA. Could other Wheeler graphs be practical alternatives to de Bruijn graphs? For instance, the original GCSA study by Sirén, Välimäki and Mäkinen proposed a way to convert a multiple alignment into an automaton that either is a Wheeler graph or can be made into one [Sirén et al., 2014]. This warrants further exploration, possibly with the help of improved tools for solving the NP-complete problem of recognizing whether a graph is a Wheeler graph [Chao et al., 2023]. The notion of BWT tunnels [Baier, 2018] gives another route: we can begin with a concatenated pangenome strings and compress it by identifying and collapsing BWT tunnels. This yields a Wheeler graph that is compressed like the de Bruijn graph, but without departing from the exact contents or coordinate systems of the original genomes. The future might need us to explore all these Wheeler-graph indexes, along with the also highly practical and always-improving world of indexes buiover collections of strings [Gagie et al., 2018].

Cite as

Ben Langmead. We Are What We Index; a Primer for the Wheeler Graph Era (Invited Talk). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 2:1-2:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{langmead:LIPIcs.WABI.2025.2,
  author =	{Langmead, Ben},
  title =	{{We Are What We Index; a Primer for the Wheeler Graph Era}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{2:1--2:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.2},
  URN =		{urn:nbn:de:0030-drops-239288},
  doi =		{10.4230/LIPIcs.WABI.2025.2},
  annote =	{Keywords: Indexing, Burrows-Wheeler Transform}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.14

Specific Patterns Against Reference Sequences

Authors: Marie-Pierre Béal and Maxime Crochemore

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

We design alignment-free techniques for comparing a set of sequences or just a word, called a target, against another set of words, called a reference. This is done with the detection of factor patterns that distinguish the target from the reference. A target-specific factor of a target T against a reference R is then a factor w of a word in T that is not a factor of a word in R but whose proper factors of w are factors of a word in R. The strategy is based on the notion of minimal absent/forbidden words. We first address the computation of the set of target-specific factors of a target T against a reference R, where T and R are finite sets of sequences. The result is the construction of an automaton accepting the set of all considered target-specific factors. The construction algorithm runs in linear time according to the size of T ∪ R. The second result is the design of an algorithm to compute all the occurrences in a single sequence T of its target-specific factors against a reference R. The algorithm runs in real-time on the target sequence, independently of the number of occurrences of target-specific factors.

Cite as

Marie-Pierre Béal and Maxime Crochemore. Specific Patterns Against Reference Sequences. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 14:1-14:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{beal_et_al:OASIcs.Grossi.14,
  author =	{B\'{e}al, Marie-Pierre and Crochemore, Maxime},
  title =	{{Specific Patterns Against Reference Sequences}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{14:1--14:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.14},
  URN =		{urn:nbn:de:0030-drops-238130},
  doi =		{10.4230/OASIcs.Grossi.14},
  annote =	{Keywords: Specific pattern, Minimal absent word, Minimal forbidden word, Directed Acyclic Word Graph (DAWG), Suffix automaton}
}

58 Search Results for "Ferragina, Paolo"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message