DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2026.62

Relative Compressed Reverse Suffix Array

Authors: Muhammed Oguzhan Kulekci, Mano Prakash Parthasarathi, Rahul Shah, and Sharma V. Thankachan

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

Suffix trees and suffix arrays are two fundamental data structures in the field of string algorithms. For a string (a.k.a. text or sequence) of length n over an alphabet of size σ, these structures typically require O(nlog n) bits of space. The FM-index provides a compressed representation of the suffix array in ≈ nlog σ bits, allowing for efficient queries on both the suffix array and its inverse array in near logarithmic time. In certain applications, such as approximate pattern matching (i.e., with wildcards, mismatches, edits), there is a need to access the suffix array of a text, as well as the suffix array of text’s reverse. Motivated by this, we explore the possibility of encoding the suffix array of the reversed text in a compact form, assuming the availability of the FM-index for the original text. Our first solution is an O(n)-bit (relative) encoding of the suffix array of the reversed text, with the time for decoding an entry being only O(log^*n) times that of decoding an entry in the text’s suffix array using FM-index. We then demonstrate how to reduce the space to O(n/κ) bits for a parameter κ, while multiplicative factor in time becomes approximately O(κlog^*n+κ³). We can also support inverse suffix array and longest common extension queries on the reversed text. These results are achieved through some careful and non-trivial application of various succinct data structure techniques.

Cite as

Muhammed Oguzhan Kulekci, Mano Prakash Parthasarathi, Rahul Shah, and Sharma V. Thankachan. Relative Compressed Reverse Suffix Array. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 62:1-62:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{kulekci_et_al:LIPIcs.STACS.2026.62,
  author =	{Kulekci, Muhammed Oguzhan and Parthasarathi, Mano Prakash and Shah, Rahul and Thankachan, Sharma V.},
  title =	{{Relative Compressed Reverse Suffix Array}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{62:1--62:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.62},
  URN =		{urn:nbn:de:0030-drops-255512},
  doi =		{10.4230/LIPIcs.STACS.2026.62},
  annote =	{Keywords: String Matching, Text Indexing, Data Structures, Suffix Trees}
}

Document

DOI: 10.4230/LIPIcs.STACS.2026.68

Dynamic Pattern Matching with Wildcards

Authors: Arshia Ataee Naeini, Amir-Parsa Mobed, Masoud Seddighin, and Saeed Seddighin

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

We study the fully dynamic pattern matching problem where the pattern may contain up to k wildcard symbols, each matching any symbol of the alphabet. Both the text and the pattern are subject to updates (insert, delete, change). We design an algorithm with 𝒪(n log² n) preprocessing and update/query time 𝒪̃(kn^{k/{k+1}} + k² log n). The bound is truly sublinear for a constant k, and sublinear when k = o(log n). We further complement our results with a conditional lower bound: assuming subquadratic preprocessing time, achieving truly sublinear update time for the case k = Ω(log n) would contradict the Strong Exponential Time Hypothesis (SETH). Finally, we develop sublinear algorithms for two special cases: - If the pattern contains w non-wildcard symbols, we give an algorithm with preprocessing time 𝒪(nw) and update time 𝒪(w + log n), which is truly sublinear whenever w is truly sublinear. - Using FFT technique combined with block decomposition, we design a deterministic truly sublinear algorithm with preprocessing time 𝒪(n^{1.8}) and update time 𝒪(n^{0.8} log n) for the case that there are at most two non-wildcards.

Cite as

Arshia Ataee Naeini, Amir-Parsa Mobed, Masoud Seddighin, and Saeed Seddighin. Dynamic Pattern Matching with Wildcards. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 68:1-68:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{naeini_et_al:LIPIcs.STACS.2026.68,
  author =	{Naeini, Arshia Ataee and Mobed, Amir-Parsa and Seddighin, Masoud and Seddighin, Saeed},
  title =	{{Dynamic Pattern Matching with Wildcards}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{68:1--68:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.68},
  URN =		{urn:nbn:de:0030-drops-255579},
  doi =		{10.4230/LIPIcs.STACS.2026.68},
  annote =	{Keywords: pattern matching, wildcards, dynamic algorithms, string algorithms, data structures}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.94

Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime

Authors: Tomasz Kociumaka and Ali Shahali

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The tree edit distance is a natural dissimilarity measure between rooted ordered trees whose nodes are labeled over an alphabet Σ. It is defined as the minimum number of node edits - insertions, deletions, and relabelings - required to transform one tree into the other. The weighted variant assigns costs ≥ 1 to edits (based on node labels), minimizing total cost rather than edit count. The unweighted tree edit distance between two trees of total size n can be computed in 𝒪(n^{2.6857}) time; in contrast, determining the weighted tree edit distance is fine-grained equivalent to the All-Pairs Shortest Paths (APSP) problem and requires n³/2^Ω(√{log n}) time [Nogler, Polak, Saha, Vassilevska Williams, Xu, Ye; STOC'25]. These impractical super-quadratic times for large, similar trees motivate the bounded version, parameterizing runtime by the distance k to enable faster algorithms for k ≪ n. Prior algorithms for bounded unweighted edit distance achieve 𝒪(nk²log n) [Akmal & Jin; ICALP’21] and 𝒪(n + k⁷log k) [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC'23]. For weighted, only 𝒪(n + k^{15}) is known [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC'23]. We present an 𝒪(n + k⁶ log k)-time algorithm for bounded tree edit distance in both weighted/unweighted settings. First, we devise a simpler weighted 𝒪(nk² log n)-time algorithm. Next, we exploit periodic structures in input trees via an optimized universal kernel: modifying prior 𝒪(n)-time 𝒪(k⁵)-size kernels to generate such structured instances, enabling efficient analysis.

Cite as

Tomasz Kociumaka and Ali Shahali. Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 94:1-94:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{kociumaka_et_al:LIPIcs.ESA.2025.94,
  author =	{Kociumaka, Tomasz and Shahali, Ali},
  title =	{{Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{94:1--94:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.94},
  URN =		{urn:nbn:de:0030-drops-245634},
  doi =		{10.4230/LIPIcs.ESA.2025.94},
  annote =	{Keywords: tree edit distance, edit distance, kernelization, dynamic programming}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.60

Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars

Authors: Jannik Olbrich

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.

Cite as

Jannik Olbrich. Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 60:1-60:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{olbrich:LIPIcs.ESA.2025.60,
  author =	{Olbrich, Jannik},
  title =	{{Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{60:1--60:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.60},
  URN =		{urn:nbn:de:0030-drops-245286},
  doi =		{10.4230/LIPIcs.ESA.2025.60},
  annote =	{Keywords: Burrows-Wheeler Transform, Grammar compression}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.36

Counting Distinct Square Substrings in Sublinear Time

Authors: Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

We show that the number of distinct squares in a packed string of length n over an alphabet of size σ can be computed in 𝒪(n/log_{σ}n) time in the word-RAM model of computation. This paper is the first to introduce a sublinear time algorithm for the packed version of squares counting. The packed representation of a string of length n over an alphabet of size σ is given as a sequence of 𝒪(n/ log_{σ} n) machine words in the word-RAM model (a machine word consists of ω ≥ log₂ n bits). Previously it was known how to count distinct squares in 𝒪(n) time [Gusfield and Stoye, JCSS 2004], even for a string over an integer alphabet, see [Crochemore et al., TCS 2014; Bannai et al., CPM 2017; Charalampopoulos et al., SPIRE 2020]. We use techniques of squares extraction from runs described by Crochemore et al. [TCS 2014]. However, the packed model requires novel approaches. In particular, we need an 𝒪(n/log_{σ}n) sized representation of all long-period runs (runs with periods that are Ω(log_{σ}n)) which guarantees sublinear time counting of potentially linearly-many implied squares. The long-period runs with a string period that is periodic itself (called layer runs) are an obstacle, since their number can be Ω(n). Fortunately, the number of all other long-period runs is 𝒪(n/log_{σ}n) and we can construct an implicit representation of all long-period runs in 𝒪(n/log_{σ}n) time by adopting the insights of Amir et al. [ESA 2019], combined with sublinear time tools provided by the PILLAR model of computations in case of packed strings. We count squares in layer runs in sublinear time by exploiting combinatorial properties of types of pyramidally-shaped groups of layer runs. As a by-product, we discover several new structural properties of runs. Another difficulty is to compute, in sublinear time, locations of Lyndon roots of runs in packed strings, which is needed for grouping of runs that can generate equal squares. To overcome this difficulty, we introduce sparse-Lyndon roots which are based on the notion of string synchronizers proposed by Kempa and Kociumaka [STOC 2019].

Cite as

Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Counting Distinct Square Substrings in Sublinear Time. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 36:1-36:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.MFCS.2025.36,
  author =	{Charalampopoulos, Panagiotis and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Counting Distinct Square Substrings in Sublinear Time}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{36:1--36:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.36},
  URN =		{urn:nbn:de:0030-drops-241439},
  doi =		{10.4230/LIPIcs.MFCS.2025.36},
  annote =	{Keywords: square in a string, packed model, run (maximal repetition), Lyndon word}
}

@InProceedings{charalampopoulos_et_al:LIPIcs.MFCS.2025.36,
  author =	{Charalampopoulos, Panagiotis and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Counting Distinct Square Substrings in Sublinear Time}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{36:1--36:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.36},
  URN =		{urn:nbn:de:0030-drops-241439},
  doi =		{10.4230/LIPIcs.MFCS.2025.36},
  annote =	{Keywords: square in a string, packed model, run (maximal repetition), Lyndon word}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.81

Efficient Matching of Some Fundamental Regular Expressions with Backreferences

Authors: Taisei Nogami and Tachio Terauchi

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

Regular expression matching is of practical importance due to its widespread use in real-world applications. In practical use, regular expressions are often used with real-world extensions. Accordingly, the matching problem of regular expressions with real-world extensions has been actively studied in recent years, yielding steady progress. However, backreference, a popular extension supported by most modern programming languages such as Java, Python, JavaScript and others in their standard libraries for string processing, is an exception to this positive trend. In fact, it is known that the matching problem of regular expressions with backreferences (rewbs) is theoretically hard and the existence of an asymptotically fast matching algorithm for arbitrary rewbs seems unlikely. Even among currently known partial solutions, the balance between efficiency and generality remains unsatisfactory. To bridge this gap, we present an efficient matching algorithm for rewbs of the form e_0 (e)_1 e_1 \1 e_2 where e_0, e, e_1, e_2 are pure regular expressions, which are fundamental and frequently used in practical applications. It runs in quadratic time with respect to the input string length, substantially improving the best-known cubic time complexity for these rewbs. Our algorithm combines ideas from both stringology and automata theory in a novel way. We leverage two techniques from automata theory, injection and summarization, to simultaneously examine matches whose backreferenced substrings are either a fixed right-maximal repeat or its extendable prefixes, which are concepts from stringology. By further utilizing a subtle property of extendable prefixes, our algorithm correctly decides the matching problem while achieving the quadratic-time complexity.

Cite as

Taisei Nogami and Tachio Terauchi. Efficient Matching of Some Fundamental Regular Expressions with Backreferences. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 81:1-81:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{nogami_et_al:LIPIcs.MFCS.2025.81,
  author =	{Nogami, Taisei and Terauchi, Tachio},
  title =	{{Efficient Matching of Some Fundamental Regular Expressions with Backreferences}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{81:1--81:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.81},
  URN =		{urn:nbn:de:0030-drops-241886},
  doi =		{10.4230/LIPIcs.MFCS.2025.81},
  annote =	{Keywords: Regular expressions, Backreferences, Regex matching, NFA simulation, Suffix arrays, Right-maximal repeats}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.9

On String and Graph Sanitization

Authors: Giulia Bernardini, Huiping Chen, Grigorios Loukides, and Solon P. Pissis

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

Data sanitization is a process that conceals sensitive patterns from a given dataset. A secondary goal is to not severely harm the utility of the underlying data along this process. We survey some recent advancements on two related data sanitization topics: string and graph sanitization. In particular, we highlight the important contributions of our friend Prof. Roberto Grossi along this journey.

Cite as

Giulia Bernardini, Huiping Chen, Grigorios Loukides, and Solon P. Pissis. On String and Graph Sanitization. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 9:1-9:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:OASIcs.Grossi.9,
  author =	{Bernardini, Giulia and Chen, Huiping and Loukides, Grigorios and Pissis, Solon P.},
  title =	{{On String and Graph Sanitization}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{9:1--9:10},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.9},
  URN =		{urn:nbn:de:0030-drops-238086},
  doi =		{10.4230/OASIcs.Grossi.9},
  annote =	{Keywords: data privacy, data sanitization, string algorithms, graph algorithms}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.6

Compact Data Structures for Collections of Sets

Authors: Jarno N. Alanko, Philip Bille, Inge Li Gørtz, Gonzalo Navarro, and Simon J. Puglisi

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

We define a new entropy measure L(𝒮), called the containment entropy, for a set 𝒮 of sets, which considers the fact that some sets can be contained in others. We show how to represent 𝒮 within space close to L(𝒮) so that any element of any set can be retrieved in logarithmic time. We extend the result to predecessor and successor queries and show how some common set operations can be implemented efficiently.

Cite as

Jarno N. Alanko, Philip Bille, Inge Li Gørtz, Gonzalo Navarro, and Simon J. Puglisi. Compact Data Structures for Collections of Sets. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 6:1-6:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:OASIcs.Grossi.6,
  author =	{Alanko, Jarno N. and Bille, Philip and G{\o}rtz, Inge Li and Navarro, Gonzalo and Puglisi, Simon J.},
  title =	{{Compact Data Structures for Collections of Sets}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{6:1--6:7},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.6},
  URN =		{urn:nbn:de:0030-drops-238051},
  doi =		{10.4230/OASIcs.Grossi.6},
  annote =	{Keywords: Compressed data structures, entropy of sets, data compression}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2025.88

Repetition Aware Text Indexing for Matching Patterns with Wildcards

Authors: Daniel Gibney, Jackson Huffstutler, Mano Prakash Parthasarathi, and Sharma V. Thankachan

Published in: LIPIcs, Volume 334, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)

Abstract

We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).

Cite as

Daniel Gibney, Jackson Huffstutler, Mano Prakash Parthasarathi, and Sharma V. Thankachan. Repetition Aware Text Indexing for Matching Patterns with Wildcards. In 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 334, pp. 88:1-88:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{gibney_et_al:LIPIcs.ICALP.2025.88,
  author =	{Gibney, Daniel and Huffstutler, Jackson and Parthasarathi, Mano Prakash and Thankachan, Sharma V.},
  title =	{{Repetition Aware Text Indexing for Matching Patterns with Wildcards}},
  booktitle =	{52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)},
  pages =	{88:1--88:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-372-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{334},
  editor =	{Censor-Hillel, Keren and Grandoni, Fabrizio and Ouaknine, Jo\"{e}l and Puppis, Gabriele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2025.88},
  URN =		{urn:nbn:de:0030-drops-234656},
  doi =		{10.4230/LIPIcs.ICALP.2025.88},
  annote =	{Keywords: Pattern Matching, Text Indexing, Wildcard Matching}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.7

FL-RMQ: A Learned Approach to Range Minimum Queries

Authors: Paolo Ferragina and Filippo Lari

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

We address the problem of designing and implementing a data structure for the Range Minimum Query problem. We show a surprising connection between this classical problem and the geometry of a properly defined set of points in the Cartesian plane. Building on this insight, we hinge upon a well-known result in Computational Geometry to introduce the first RMQ solution that exploits (i.e., learns) the distribution of such 2D-points via proper error-bounded linear approximations. Because of these features, we name the resulting data structure: Fully-Learned RMQ, shortly FL-RMQ. We prove theoretical bounds for its space usage and query time, covering both worst-case scenarios and average-case performance for uniformly distributed inputs. These bounds compare favorably with the ones achievable by the best-known indexing solutions (i.e., the ones that allow access to the indexed array), especially when the input data follow some geometric regularities that we characterize in the paper, thus providing principled evidence of FL-RMQ being a novel data-aware solution to the RMQ problem. We corroborate our theoretical findings with a wide set of experiments showing that FL-RMQ offers more robust space-time trade-offs than the other known practical indexing solutions on both artificial and real-world datasets. We believe that our novel approach to the RMQ problem is noteworthy not only for its interesting space-time trade-offs, but also because it is flexible enough to be applied easily to the encoding variant of RMQ (i.e., the one that does not allow access to the indexed array), and moreover, because it paves the way to research opportunities on possibly other problems.

Cite as

Paolo Ferragina and Filippo Lari. FL-RMQ: A Learned Approach to Range Minimum Queries. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 7:1-7:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ferragina_et_al:LIPIcs.CPM.2025.7,
  author =	{Ferragina, Paolo and Lari, Filippo},
  title =	{{FL-RMQ: A Learned Approach to Range Minimum Queries}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{7:1--7:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.7},
  URN =		{urn:nbn:de:0030-drops-231014},
  doi =		{10.4230/LIPIcs.CPM.2025.7},
  annote =	{Keywords: Range-Minimum query, Learned data structures, Compact data structures, Experimental results}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.9

Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time

Authors: Yuto Iguchi, Ryo Yoshinaka, and Ayumi Shinohara

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

Run-length straight-line programs (RLSLPs) are a technique for grammar-based compression, allowing any string to be represented with optimal space for δ, the substring complexity of the string. We address the compressed pattern matching problem for RLSLPs: Given a compressed text in RLSLP format and an uncompressed pattern, determine if the pattern appears in the text. This paper proposes an algorithm that solves this problem in linear time with respect to the size of the grammar and the length of the pattern.

Cite as

Yuto Iguchi, Ryo Yoshinaka, and Ayumi Shinohara. Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 9:1-9:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{iguchi_et_al:LIPIcs.CPM.2025.9,
  author =	{Iguchi, Yuto and Yoshinaka, Ryo and Shinohara, Ayumi},
  title =	{{Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{9:1--9:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.9},
  URN =		{urn:nbn:de:0030-drops-231034},
  doi =		{10.4230/LIPIcs.CPM.2025.9},
  annote =	{Keywords: pattern matching, run-length straight-line programs, compression, suffix tree}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.13

Doubly-Periodic String Comparison

Authors: Nikita Gaevoy, Boris Zolotov, and Alexander Tiskin

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

The longest common subsequence (LCS) problem is a fundamental algorithmic problem. Given a pair of strings, the problem asks for the length of the longest string that is a subsequence in both input strings. Among the many relatives of this problem, there is its natural version where one or both of input strings have periodic structure. The case where only one of the input strings is periodic has been considered before; in this work, we develop an efficient algorithm for the more difficult case where both input strings are periodic. The algorithm is based on the existing algebraic framework for the LCS problem, developed by the third author; in particular, we extend this framework to dealing with affine (i.e. doubly-infinite periodic) permutations instead of finite ones. Given input strings that are a k-repeat of a period of length m and an 𝓁-repeat of a period of length n, the resulting algorithm runs in time O(mn+n log n log k), which is a substantial improvement over existing approaches. The algorithm has been implemented by the first author; by running his code, one can process pairs of periodic input strings with lengths far beyond the reach of all known alternative algorithms.

Cite as

Nikita Gaevoy, Boris Zolotov, and Alexander Tiskin. Doubly-Periodic String Comparison. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 13:1-13:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{gaevoy_et_al:LIPIcs.CPM.2025.13,
  author =	{Gaevoy, Nikita and Zolotov, Boris and Tiskin, Alexander},
  title =	{{Doubly-Periodic String Comparison}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{13:1--13:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.13},
  URN =		{urn:nbn:de:0030-drops-231079},
  doi =		{10.4230/LIPIcs.CPM.2025.13},
  annote =	{Keywords: String Comparison, periodic Strings, Longest common Subsequence, affine Hecke Monoid, affine sticky Braids}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.24

Sorted Consecutive Occurrence Queries in Substrings

Authors: Waseem Akram and Takuya Mieno

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text T of length n for preprocessing and a pattern P of length m as a query, the goal is to report all occurrences of P as substrings of T. Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of P in T such that their gaps (i.e., the distances between them) lie within a query-specified range [g₁, g₂]. Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top-k close consecutive occurrence query, which reports the k closest consecutive occurrences of P in T, sorted in non-decreasing order of distance. Both problems are optimally solved in query time with O(n log n)-space data structures. In this paper, we generalize these problems to the range query model, which focuses only on occurrences of P in a specified substring T[a.. b] of T. Our contributions are as follows: - We propose an O(n log² n)-space data structure that answers the range top-k consecutive occurrence query in O(|P| + log log n + k) time. - We propose an O(n log^{2+ε} n)-space data structure that answers the range gap-bounded consecutive occurrence query in O(|P| + log log n + output) time, where ε is a positive constant and output denotes the number of outputs. Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.

Cite as

Waseem Akram and Takuya Mieno. Sorted Consecutive Occurrence Queries in Substrings. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 24:1-24:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{akram_et_al:LIPIcs.CPM.2025.24,
  author =	{Akram, Waseem and Mieno, Takuya},
  title =	{{Sorted Consecutive Occurrence Queries in Substrings}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{24:1--24:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.24},
  URN =		{urn:nbn:de:0030-drops-231187},
  doi =		{10.4230/LIPIcs.CPM.2025.24},
  annote =	{Keywords: string algorithm, consecutive occurrences, suffix tree}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.20

Text Indexing for Simple Regular Expressions

Authors: Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau, Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

We study the problem of indexing a text T[1..n] ∈ Σⁿ so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P₁ D^* P₂, where P₁ and P₂ are strings and D = {c₁, …, c_k} ⊂ Σ is a character-class (shorthand for the regular expression (c₁ | c₂ | ⋯ | c_k)) and (2) String Kleene-star patterns of the form P₁ P^* P₂ where P, P₁ and P₂ are strings. In case (1), we describe an index of O(nlog^{1+ε}n) space (for any constant ε > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P₁P₂ that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ+1)log^{ε}n) on any alphabet size.

Cite as

Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau, Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow. Text Indexing for Simple Regular Expressions. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 20:1-20:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:LIPIcs.CPM.2025.20,
  author =	{Bannai, Hideo and Bille, Philip and G{\o}rtz, Inge Li and Landau, Gad M. and Navarro, Gonzalo and Prezza, Nicola and Steiner, Teresa Anna and Tarnow, Simon Rumle},
  title =	{{Text Indexing for Simple Regular Expressions}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{20:1--20:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.20},
  URN =		{urn:nbn:de:0030-drops-231143},
  doi =		{10.4230/LIPIcs.CPM.2025.20},
  annote =	{Keywords: Text indexing, regular expressions, data structures}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.21

Compressed Dictionary Matching on Run-Length Encoded Strings

Authors: Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

Given a set of pattern strings 𝒫 = {P₁, P₂,… P_k} and a text string S, the classic dictionary matching problem is to report all occurrences of each pattern in S. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let m and n be the total length of the patterns 𝒫 and the length of the text string S, respectively, and let ̅m and ̅n be the total number of runs in the run-length encoding of the patterns in 𝒫 and S, respectively. Our main result is an algorithm that achieves O(( ̅m + ̅n)log log m + occ) expected time, and O( ̅m) space, where occ is the total number of occurrences of patterns in S. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an log log m factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.

Cite as

Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Compressed Dictionary Matching on Run-Length Encoded Strings. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 21:1-21:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2025.21,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Puglisi, Simon J. and Tarnow, Simon R.},
  title =	{{Compressed Dictionary Matching on Run-Length Encoded Strings}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{21:1--21:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.21},
  URN =		{urn:nbn:de:0030-drops-231158},
  doi =		{10.4230/LIPIcs.CPM.2025.21},
  annote =	{Keywords: Dictionary matching, run-length encoding, compressed pattern matching}
}

44 Search Results for "Bille, Philip"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message