DROPS

Document

Wheeler Graphs and Wheeler Languages

Authors: Nicola Cotumaccio, Giovanna D'Agostino, Daniel Gibney, Alberto Policriti, Nicola Prezza, and Sharma V. Thankachan

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

Suffix sorting stands at the core of the most efficient solutions for indexed pattern matching: the suffix tree, the suffix array, compressed indexes based on the Burrows-Wheeler transform, and so on. In [Gagie, Manzini, Sirén, TCS 2017] this concept was extended to labeled graphs, obtaining the rich class of Wheeler graphs. This work opened a very fruitful line of research, ultimately generating results able to bridge the fields of compressed data structures, graph theory, and regular language theory. In a Wheeler graph, nodes are sorted according to the alphabetic order of their incoming labels, propagating this order through pairs of equally-labeled edges. This apparently-simple definition makes it possible to solve on Wheeler graphs problems (including, but not limited to: compression, subpath queries, NFA equivalence, determinization, minimization) that on general labeled graphs are extremely hard to solve, and induces a rich structure in the class of regular languages (Wheeler languages) recognized by automata whose state transition is a Wheeler graph. The goal of this survey is to provide a summary of (and intuitions behind) the results on Wheeler graphs that appeared in the literature since their introduction, in addition to a discussion of interesting problems that are still open in the field.

Cite as

Nicola Cotumaccio, Giovanna D'Agostino, Daniel Gibney, Alberto Policriti, Nicola Prezza, and Sharma V. Thankachan. Wheeler Graphs and Wheeler Languages. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 12:1-12:28, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{cotumaccio_et_al:OASIcs.Manzini.12,
  author =	{Cotumaccio, Nicola and D'Agostino, Giovanna and Gibney, Daniel and Policriti, Alberto and Prezza, Nicola and Thankachan, Sharma V.},
  title =	{{Wheeler Graphs and Wheeler Languages}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{12:1--12:28},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.12},
  URN =		{urn:nbn:de:0030-drops-239205},
  doi =		{10.4230/OASIcs.Manzini.12},
  annote =	{Keywords: Wheeler languages, Wheeler graphs, pattern matching, indexing, compressed data structures}
}

Document

Complete Volume

DOI: 10.4230/LIPIcs.SEA.2025

LIPIcs, Volume 338, SEA 2025, Complete Volume

Authors: Petra Mutzel and Nicola Prezza

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

LIPIcs, Volume 338, SEA 2025, Complete Volume

Cite as

23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 1-502, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@Proceedings{mutzel_et_al:LIPIcs.SEA.2025,
  title =	{{LIPIcs, Volume 338, SEA 2025, Complete Volume}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{1--502},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025},
  URN =		{urn:nbn:de:0030-drops-239259},
  doi =		{10.4230/LIPIcs.SEA.2025},
  annote =	{Keywords: LIPIcs, Volume 338, SEA 2025, Complete Volume}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.SEA.2025.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Petra Mutzel and Nicola Prezza

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 0:i-0:xviii, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{mutzel_et_al:LIPIcs.SEA.2025.0,
  author =	{Mutzel, Petra and Prezza, Nicola},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{0:i--0:xviii},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.0},
  URN =		{urn:nbn:de:0030-drops-239249},
  doi =		{10.4230/LIPIcs.SEA.2025.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.15

Encoding Co-Lex Orders of Finite-State Automata in Linear Space

Authors: Ruben Becker, Nicola Cotumaccio, Sung-Hwan Kim, Nicola Prezza, and Carlo Tosoni

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

The Burrows-Wheeler transform (BWT) is a string transformation that enhances string indexing and compressibility. Cotumaccio and Prezza [SODA '21] extended this transformation to nondeterministic finite automata (NFAs) through co-lexicographic partial orders, i.e., by sorting the states of an NFA according to the co-lexicographic order of the strings reaching them. As the BWT of an NFA shares many properties with its original string variant, the transformation can be used to implement indices for locating specific patterns on the NFA itself. The efficiency of the resulting index is influenced by the width of the partial order on the states: the smaller the width, the faster the index. The most efficient index for arbitrary NFAs currently known in the literature is based on the coarsest forward-stable co-lex (CFS) order of Becker et al. [SPIRE '24]. In this paper, we prove that this CFS order can be encoded within linear space in the number of states in the automaton. The importance of this result stems from the fact that encoding such an order in linear space represents a big first step in the direction of building the index based on this order in near-linear time - the biggest open research question in this context. The currently most efficient known algorithm for this task run in quadratic time in the number of transitions in the NFA and are thus infeasible to run on very large graphs (e.g., pangenome graphs). At this point, a near-linear time algorithm is solely known for the simpler case of deterministic automata [Becker et al., ESA '23] and, in fact, this algorithmic result was enabled by a linear space encoding for deterministic automata [Kim et al., CPM '23].

Cite as

Ruben Becker, Nicola Cotumaccio, Sung-Hwan Kim, Nicola Prezza, and Carlo Tosoni. Encoding Co-Lex Orders of Finite-State Automata in Linear Space. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 15:1-15:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{becker_et_al:LIPIcs.CPM.2025.15,
  author =	{Becker, Ruben and Cotumaccio, Nicola and Kim, Sung-Hwan and Prezza, Nicola and Tosoni, Carlo},
  title =	{{Encoding Co-Lex Orders of Finite-State Automata in Linear Space}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{15:1--15:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.15},
  URN =		{urn:nbn:de:0030-drops-231094},
  doi =		{10.4230/LIPIcs.CPM.2025.15},
  annote =	{Keywords: Burrows-Wheeler Transform, Co-Lexicographic Orders, Nondeterministic Finite Automata, Graph Walks}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.19

The Trie Measure, Revisited

Authors: Jarno N. Alanko, Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Bojana Kodric, and Nicola Prezza

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

In this paper, we study the following problem: given n subsets S₁, … , S_n of an integer universe U = {0,… , u-1}, having total cardinality N = ∑_{i = 1}ⁿ |S_i|, find a prefix-free encoding enc : U → {0,1}^+ minimizing the so-called trie measure, i.e., the total number of edges in the n binary tries T₁, … , T_n, where T_i is the trie packing the encoded integers {enc(x):x ∈ S_i}. We first observe that this problem is equivalent to that of merging u sets with the cheapest sequence of binary unions, a problem which in [Ghosh et al., ICDCS 2015] is shown to be NP-hard. Motivated by the hardness of the general problem, we focus on particular families of prefix-free encodings. We start by studying the fixed-length shifted encoding of [Gupta et al., Theoretical Computer Science 2007]. Given a parameter 0 ≤ a < u, this encoding sends each x ∈ U to (x + a) mod u, interpreted as a bit-string of log u bits. We develop the first efficient algorithms that find the value of a minimizing the trie measure when this encoding is used. Our two algorithms run in O(u + Nlog u) and O(Nlog² u) time, respectively. We proceed by studying ordered encodings (a.k.a. monotone or alphabetic), and describe an algorithm finding the optimal such encoding in O(N+u³) time. Within the same running time, we show how to compute the best shifted ordered encoding, provably no worse than both the optimal shifted and optimal ordered encodings. We provide implementations of our algorithms and discuss how these encodings perform in practice.

Cite as

Jarno N. Alanko, Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Bojana Kodric, and Nicola Prezza. The Trie Measure, Revisited. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 19:1-19:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.CPM.2025.19,
  author =	{Alanko, Jarno N. and Becker, Ruben and Cenzato, Davide and Gagie, Travis and Kim, Sung-Hwan and Kodric, Bojana and Prezza, Nicola},
  title =	{{The Trie Measure, Revisited}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{19:1--19:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.19},
  URN =		{urn:nbn:de:0030-drops-231135},
  doi =		{10.4230/LIPIcs.CPM.2025.19},
  annote =	{Keywords: Succinct data structures, degenerate strings, integer encoding}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.20

Text Indexing for Simple Regular Expressions

Authors: Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau, Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

We study the problem of indexing a text T[1..n] ∈ Σⁿ so that, later, given a query regular expression pattern R of size m = |R|, we can report all the occ substrings T[i..j] of T matching R. The problem is known to be hard for arbitrary patterns R, so in this paper, we consider the following two types of patterns. (1) Character-class Kleene-star patterns of the form P₁ D^* P₂, where P₁ and P₂ are strings and D = {c₁, …, c_k} ⊂ Σ is a character-class (shorthand for the regular expression (c₁ | c₂ | ⋯ | c_k)) and (2) String Kleene-star patterns of the form P₁ P^* P₂ where P, P₁ and P₂ are strings. In case (1), we describe an index of O(nlog^{1+ε}n) space (for any constant ε > 0) solving queries in time O(m + log n/log log n + occ) on constant-sized alphabets. We also describe a general solution for any alphabet size. This result is conditioned on the existence of an anchor: a character of P₁P₂ that does not belong to D. We justify this assumption by proving that no efficient indexing solution can exist if an anchor is not present unless the Set Disjointness Conjecture fails. In case (2), we describe an index of size O(n) answering queries in time O(m + (occ+1)log^{ε}n) on any alphabet size.

Cite as

Hideo Bannai, Philip Bille, Inge Li Gørtz, Gad M. Landau, Gonzalo Navarro, Nicola Prezza, Teresa Anna Steiner, and Simon Rumle Tarnow. Text Indexing for Simple Regular Expressions. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 20:1-20:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:LIPIcs.CPM.2025.20,
  author =	{Bannai, Hideo and Bille, Philip and G{\o}rtz, Inge Li and Landau, Gad M. and Navarro, Gonzalo and Prezza, Nicola and Steiner, Teresa Anna and Tarnow, Simon Rumle},
  title =	{{Text Indexing for Simple Regular Expressions}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{20:1--20:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.20},
  URN =		{urn:nbn:de:0030-drops-231143},
  doi =		{10.4230/LIPIcs.CPM.2025.20},
  annote =	{Keywords: Text indexing, regular expressions, data structures}
}

Document

DOI: 10.4230/DagRep.14.11.108

Regular Expressions: Matching and Indexing (Dagstuhl Seminar 24472)

Authors: Inge Li Gørtz, Sebastian Maneth, Gonzalo Navarro, and Nicola Prezza

Published in: Dagstuhl Reports, Volume 14, Issue 11 (2025)

Abstract

This report documents the program and the outcomes of Dagstuhl Seminar 24472 "Regular Expressions: Matching and Indexing".

Cite as

Inge Li Gørtz, Sebastian Maneth, Gonzalo Navarro, and Nicola Prezza. Regular Expressions: Matching and Indexing (Dagstuhl Seminar 24472). In Dagstuhl Reports, Volume 14, Issue 11, pp. 108-119, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@Article{gortz_et_al:DagRep.14.11.108,
  author =	{G{\o}rtz, Inge Li and Maneth, Sebastian and Navarro, Gonzalo and Prezza, Nicola},
  title =	{{Regular Expressions: Matching and Indexing (Dagstuhl Seminar 24472)}},
  pages =	{108--119},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2025},
  volume =	{14},
  number =	{11},
  editor =	{G{\o}rtz, Inge Li and Maneth, Sebastian and Navarro, Gonzalo and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.14.11.108},
  URN =		{urn:nbn:de:0030-drops-228177},
  doi =		{10.4230/DagRep.14.11.108},
  annote =	{Keywords: finite automata, regular expressions, complex patterns, text indexing, graph matching and indexing}
}

Document

DOI: 10.4230/LIPIcs.SEA.2024.9

Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression

Authors: Patrick Dinklage, Johannes Fischer, and Nicola Prezza

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

We present novel online approximations of the Lempel-Ziv 77 (LZ77) and Lempel-Ziv 78 (LZ78) compression schemes [Lempel & Ziv, 1977/1978] with parameterizable space usage based on estimating which k patterns occur the most frequently in the streamed input for parameter k. This new approach overcomes the issue of finding only local repetitions, which is a natural limitation of algorithms that compress using a sliding window or by partitioning the input into blocks. For this, we introduce the top-k trie, a summary for maintaining online the top-k frequent consecutive patterns in a stream of characters based on a combination of the Lempel-Ziv 78 compression scheme and the Misra-Gries algorithm for frequent item estimation in streams. Using straightforward encoding, our implementations yield compression ratios (output over input size) competitive with established general-purpose LZ-based compression utilities such as gzip or xz.

Cite as

Patrick Dinklage, Johannes Fischer, and Nicola Prezza. Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 9:1-9:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{dinklage_et_al:LIPIcs.SEA.2024.9,
  author =	{Dinklage, Patrick and Fischer, Johannes and Prezza, Nicola},
  title =	{{Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{9:1--9:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.9},
  URN =		{urn:nbn:de:0030-drops-203748},
  doi =		{10.4230/LIPIcs.SEA.2024.9},
  annote =	{Keywords: compression, streaming, heavy hitters, algorithm engineering}
}

Document

DOI: 10.4230/LIPIcs.CPM.2024.1

Computing the LCP Array of a Labeled Graph

Authors: Jarno N. Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Abstract

The LCP array is an important tool in stringology, allowing to speed up pattern matching algorithms and enabling compact representations of the suffix tree. Recently, Conte et al. [DCC 2023] and Cotumaccio et al. [SPIRE 2023] extended the definition of this array to Wheeler DFAs and, ultimately, to arbitrary labeled graphs, proving that it can be used to efficiently solve matching statistics queries on the graph’s paths. In this paper, we provide the first efficient algorithm building the LCP array of a directed labeled graph with n nodes and m edges labeled over an alphabet of size σ. The first step is to transform the input graph G into a deterministic Wheeler pseudoforest G_{is} with O(n) edges encoding the lexicographically- smallest and largest strings entering in each node of the original graph. Using state-of-the-art algorithms, this step runs in O(min{mlog n, m+n²}) time on arbitrary labeled graphs, and in O(m) time on Wheeler DFAs. The LCP array of G stores the longest common prefixes between those strings, i.e. it can easily be derived from the LCP array of G_{is}. After arguing that the natural generalization of a compact-space LCP-construction algorithm by Beller et al. [J. Discrete Algorithms 2013] runs in time Ω(nσ) on pseudoforests, we present a new algorithm based on dynamic range stabbing building the LCP array of G_{is} in O(nlog σ) time and O(nlogσ) bits of working space. Combined with our reduction, we obtain the first efficient algorithm to build the LCP array of an arbitrary labeled graph. An implementation of our algorithm is publicly available at https://github.com/regindex/Labeled-Graph-LCP.

Cite as

Jarno N. Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. Computing the LCP Array of a Labeled Graph. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 1:1-1:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.CPM.2024.1,
  author =	{Alanko, Jarno N. and Cenzato, Davide and Cotumaccio, Nicola and Kim, Sung-Hwan and Manzini, Giovanni and Prezza, Nicola},
  title =	{{Computing the LCP Array of a Labeled Graph}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{1:1--1:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.1},
  URN =		{urn:nbn:de:0030-drops-201113},
  doi =		{10.4230/LIPIcs.CPM.2024.1},
  annote =	{Keywords: LCP array, Wheeler automata, prefix sorting, pattern matching, sorting}
}

Document

DOI: 10.4230/LIPIcs.CPM.2024.5

Random Wheeler Automata

Authors: Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Riccardo Maso, and Nicola Prezza

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Abstract

Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the natural co-lexicographic order of the strings labeling the automaton’s paths; this property makes it possible to represent the automaton’s topology in a constant number of bits per transition, as well as efficiently solving pattern matching queries on its accepted regular language. After their introduction, Wheeler automata have been the subject of a prolific line of research, both from the algorithmic and language-theoretic points of view. A recurring issue faced in these studies is the lack of large datasets of Wheeler automata on which the developed algorithms and theories could be tested. One possible way to overcome this issue is to generate random Wheeler automata. Motivated by this observation of practical nature, in this paper we initiate the theoretical study of random Wheeler automata, focusing our attention on the deterministic case (Wheeler DFAs - WDFAs). We start by naturally extending the Erdős-Rényi random graph model to WDFAs, and proceed by providing an algorithm generating uniform WDFAs according to this model. Our algorithm generates a uniform WDFA with n states, m transitions, and alphabet’s cardinality σ in O(m) expected time (O(mlog m) time w.h.p.) and constant working space for all alphabets of size σ ≤ m/ln m. The output WDFA is streamed directly to the output. As a by-product, we also give formulas for the number of distinct WDFAs and obtain that nσ + (n - σ) log σ bits are necessary and sufficient to encode a WDFA with n states and alphabet of size σ, up to an additive Θ(n) term. We present an implementation of our algorithm and show that it is extremely fast in practice, with a throughput of over 8 million transitions per second.

Cite as

Ruben Becker, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Riccardo Maso, and Nicola Prezza. Random Wheeler Automata. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 5:1-5:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{becker_et_al:LIPIcs.CPM.2024.5,
  author =	{Becker, Ruben and Cenzato, Davide and Kim, Sung-Hwan and Kodric, Bojana and Maso, Riccardo and Prezza, Nicola},
  title =	{{Random Wheeler Automata}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{5:1--5:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.5},
  URN =		{urn:nbn:de:0030-drops-201157},
  doi =		{10.4230/LIPIcs.CPM.2024.5},
  annote =	{Keywords: Wheeler automata, Burrows-Wheeler transform, random graphs}
}

Document

DOI: 10.4230/LIPIcs.CPM.2024.23

The Rational Construction of a Wheeler DFA

Authors: Giovanni Manzini, Alberto Policriti, Nicola Prezza, and Brian Riccardi

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Abstract

Deterministic Finite Wheeler Automata are a natural generalisation to regular languages of the theory of compressed data structures originated by the introduction of the Burrows-Wheeler transform. Indeed, if we can find a Wheeler automaton recognizing a given language L, such automaton can be used to design time and space efficient algorithms for representing and searching L. In this paper we introduce an alternative representation of Deterministic Wheeler Automata by showing that a natural map between strings and rational numbers in ℚ [0,1) can be extended to represent the automaton’s states as intervals in ℚ [0,1). Using this representation it emerges a natural relationship between automata properties and some properties of real numbers. In addition, such representation enables us to formulate problems related to automata in a numerical setting. Although at the moment the numerical approach does not lead to time efficient algorithms, we believe this new perspective deserves further consideration. As a further demonstration of the convenience of this new representation, we use it to provide a simple proof of an unexpected result on regular languages. More precisely, we compare the size of the smallest Wheeler automaton recognizing a given language L with respect to the size of the smallest automaton, possibly non-Wheeler, recognizing the same language. We show settings in which there can be an exponential gap between the two sizes, and we discuss the implications of this result on the problem of representing regular languages.

Cite as

Giovanni Manzini, Alberto Policriti, Nicola Prezza, and Brian Riccardi. The Rational Construction of a Wheeler DFA. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 23:1-23:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{manzini_et_al:LIPIcs.CPM.2024.23,
  author =	{Manzini, Giovanni and Policriti, Alberto and Prezza, Nicola and Riccardi, Brian},
  title =	{{The Rational Construction of a Wheeler DFA}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{23:1--23:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.23},
  URN =		{urn:nbn:de:0030-drops-201336},
  doi =		{10.4230/LIPIcs.CPM.2024.23},
  annote =	{Keywords: String Matching, Deterministic Finite Automata, Wheeler languages, Graph Indexing, Co-lexicographical Sorting}
}

Document

DOI: 10.4230/LIPIcs.ESA.2023.15

Sorting Finite Automata via Partition Refinement

Authors: Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, and Nicola Prezza

Published in: LIPIcs, Volume 274, 31st Annual European Symposium on Algorithms (ESA 2023)

Abstract

Wheeler nondeterministic finite automata (WNFAs) were introduced in (Gagie et al., TCS 2017) as a powerful generalization of prefix sorting from strings to labeled graphs. WNFAs admit optimal solutions to classic hard problems on labeled graphs and languages such as compression and regular expression matching. The problem of deciding whether a given NFA is Wheeler is known to be NP-complete (Gibney and Thankachan, ESA 2019). Recently, however, Alanko et al. (Information and Computation 2021) showed how to side-step this complexity by switching to preorders: letting Q be the set of states and δ the set of transitions, they provided a O(|δ|⋅|Q|²)-time algorithm computing a totally-ordered partition (i.e. equivalence relation) of the WNFA’s states such that (1) equivalent states recognize the same regular language, and (2) the order of (the classes of) non-equivalent states is consistent with any Wheeler order, when one exists. As a result, the output is a preorder of the states as useful for pattern matching as standard Wheeler orders. Further extensions of this line of work (Cotumaccio et al., SODA 2021 and DCC 2022) generalized these concepts to arbitrary NFAs by introducing co-lex partial preorders: in general, any NFA admits a partial preorder of its states reflecting the co-lexicographic order of their accepted strings; the smaller the width of such preorder is, the faster regular expression matching queries can be performed. To date, the fastest algorithm for computing the smallest-width partial preorder on NFAs runs in O(|δ|² + |Q|^{5/2}) time (Cotumaccio, DCC 2022), while on DFAs the same task can be accomplished in O(min(|Q|²log|Q|, |δ|⋅|Q|)) time (Kim et al., CPM 2023). In this paper, we provide much more efficient solutions to the co-lex order computation problem. Our results are achieved by extending a classic algorithm for the relational coarsest partition refinement problem of Paige and Tarjan to work with ordered partitions. More specifically, we provide a O(|δ|log|Q|)-time algorithm computing a co-lex total preorder when the input is a Wheeler NFA, and an algorithm with the same time complexity computing the smallest-width co-lex partial order of any DFA. In addition, we present implementations of our algorithms and show that they are very efficient also in practice.

Cite as

Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, and Nicola Prezza. Sorting Finite Automata via Partition Refinement. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{becker_et_al:LIPIcs.ESA.2023.15,
  author =	{Becker, Ruben and C\'{a}ceres, Manuel and Cenzato, Davide and Kim, Sung-Hwan and Kodric, Bojana and Olivares, Francisco and Prezza, Nicola},
  title =	{{Sorting Finite Automata via Partition Refinement}},
  booktitle =	{31st Annual European Symposium on Algorithms (ESA 2023)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-295-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{274},
  editor =	{G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2023.15},
  URN =		{urn:nbn:de:0030-drops-186684},
  doi =		{10.4230/LIPIcs.ESA.2023.15},
  annote =	{Keywords: Wheeler automata, prefix sorting, pattern matching, graph compression, sorting, partition refinement}
}

@InProceedings{becker_et_al:LIPIcs.ESA.2023.15,
  author =	{Becker, Ruben and C\'{a}ceres, Manuel and Cenzato, Davide and Kim, Sung-Hwan and Kodric, Bojana and Olivares, Francisco and Prezza, Nicola},
  title =	{{Sorting Finite Automata via Partition Refinement}},
  booktitle =	{31st Annual European Symposium on Algorithms (ESA 2023)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-295-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{274},
  editor =	{G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2023.15},
  URN =		{urn:nbn:de:0030-drops-186684},
  doi =		{10.4230/LIPIcs.ESA.2023.15},
  annote =	{Keywords: Wheeler automata, prefix sorting, pattern matching, graph compression, sorting, partition refinement}
}

Document

DOI: 10.4230/LIPIcs.CPM.2023.16

Faster Prefix-Sorting Algorithms for Deterministic Finite Automata

Authors: Sung-Hwan Kim, Francisco Olivares, and Nicola Prezza

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these techniques were extended to sets of strings described by means of finite automata: the theory of Wheeler graphs [Gagie et al., TCS'17] introduced automata whose states can be totally-sorted according to the co-lexicographic (co-lex in the following) order of the prefixes of words accepted by the automaton. More recently, in [Cotumaccio, Prezza, SODA'21] it was shown how to extend these ideas to arbitrary automata by means of partial co-lex orders. This work showed that a co-lex order of minimum width (thus optimizing search query times) on deterministic finite automata (DFAs) can be computed in O(m² + n^{5/2}) time, m being the number of transitions and n the number of states of the input DFA. In this paper, we exhibit new combinatorial properties of the minimum-width co-lex order of DFAs and exploit them to design faster prefix sorting algorithms. In particular, we describe two algorithms sorting arbitrary DFAs in O(mn) and O(n² log n) time, respectively, and an algorithm sorting acyclic DFAs in O(m log n) time. Within these running times, all algorithms compute also a smallest chain partition of the partial order (required to index the DFA). We present an experiment result to show that an optimized implementation of the O(n² log n)-time algorithm exhibits a nearly-linear behaviour on large deterministic pan-genomic graphs and is thus also of practical interest.

Cite as

Sung-Hwan Kim, Francisco Olivares, and Nicola Prezza. Faster Prefix-Sorting Algorithms for Deterministic Finite Automata. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{kim_et_al:LIPIcs.CPM.2023.16,
  author =	{Kim, Sung-Hwan and Olivares, Francisco and Prezza, Nicola},
  title =	{{Faster Prefix-Sorting Algorithms for Deterministic Finite Automata}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.16},
  URN =		{urn:nbn:de:0030-drops-179707},
  doi =		{10.4230/LIPIcs.CPM.2023.16},
  annote =	{Keywords: String Matching, Deterministic Finite Automata, Graph Indexing, Co-lexicographical Sorting}
}

Document

DOI: 10.4230/LIPIcs.CPM.2021.16

Compressed Weighted de Bruijn Graphs

Authors: Giuseppe F. Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest.

Cite as

Giuseppe F. Italiano, Nicola Prezza, Blerina Sinaimeri, and Rossano Venturini. Compressed Weighted de Bruijn Graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{italiano_et_al:LIPIcs.CPM.2021.16,
  author =	{Italiano, Giuseppe F. and Prezza, Nicola and Sinaimeri, Blerina and Venturini, Rossano},
  title =	{{Compressed Weighted de Bruijn Graphs}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.16},
  URN =		{urn:nbn:de:0030-drops-139675},
  doi =		{10.4230/LIPIcs.CPM.2021.16},
  annote =	{Keywords: weighted de Bruijn graphs, k-mer annotation, compressed data structures, partial sums}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.SEA.2020.3

Indexing Compressed Text: A Tale of Time and Space (Invited Talk)

Authors: Nicola Prezza

Published in: LIPIcs, Volume 160, 18th International Symposium on Experimental Algorithms (SEA 2020)

Abstract

Text indexing is a classical algorithmic problem that has been studied for over four decades. The earliest optimal-time solution to the problem, the suffix tree [Weiner, 1973], dates back to 1973 and requires up to two orders of magnitude more space than the text to be stored. In the year 2000, two breakthrough works [Grossi and Vitter, 2000; Ferragina and Manzini, 2000] showed that this space overhead is not necessary: both the index and the text can be stored in a space proportional to the text’s entropy. These contributions had an enormous impact in bioinformatics: nowadays, the two most widely-used DNA aligners employ compressed indexes [Li and Durbin, 2009; Langmead et al., 2009]. In recent years, it became apparent that entropy had reached its limits: modern datasets (for example, collections of thousands of human genomes) are extremely large but very repetitive and, by its very definition, entropy cannot compress repetitive texts [S. Kreft and G. Navarro, 2013]. To overcome this problem, a new generation of indexes based on dictionary compressors (for example, LZ77 and run-length BWT) emerged [S. Kreft and G. Navarro, 2013; Gagie et al., 2020; F. Claude and G. Navarro, 2012], together with generalizations of the indexing problem to labeled graphs [Ferragina et al., 2009; Sirén et al., 2014; Travis Gagie et al., 2017]. This talk is a short and friendly survey of the landmarks of this fascinating path that took us from suffix trees to the most modern compressed indexes on labeled graphs.

Cite as

Nicola Prezza. Indexing Compressed Text: A Tale of Time and Space (Invited Talk). In 18th International Symposium on Experimental Algorithms (SEA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 160, pp. 3:1-3:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{prezza:LIPIcs.SEA.2020.3,
  author =	{Prezza, Nicola},
  title =	{{Indexing Compressed Text: A Tale of Time and Space}},
  booktitle =	{18th International Symposium on Experimental Algorithms (SEA 2020)},
  pages =	{3:1--3:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-148-1},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{160},
  editor =	{Faro, Simone and Cantone, Domenico},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2020.3},
  URN =		{urn:nbn:de:0030-drops-120772},
  doi =		{10.4230/LIPIcs.SEA.2020.3},
  annote =	{Keywords: Compressed Text Indexing}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.4

Optimal Rank and Select Queries on Dictionary-Compressed Text

Authors: Nicola Prezza

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

We study the problem of supporting queries on a string S of length n within a space bounded by the size gamma of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/gamma)/log log n) time within O(gamma polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations.

Cite as

Nicola Prezza. Optimal Rank and Select Queries on Dictionary-Compressed Text. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 4:1-4:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{prezza:LIPIcs.CPM.2019.4,
  author =	{Prezza, Nicola},
  title =	{{Optimal Rank and Select Queries on Dictionary-Compressed Text}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{4:1--4:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.4},
  URN =		{urn:nbn:de:0030-drops-104756},
  doi =		{10.4230/LIPIcs.CPM.2019.4},
  annote =	{Keywords: Rank, Select, Dictionary compression, String Attractors}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.7

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Authors: Nicola Prezza and Giovanna Rosone

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1,sigma] can be computed from the Burrows-Wheeler transformed collection in O(n log sigma) time using o(n log sigma) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM.

Cite as

Nicola Prezza and Giovanna Rosone. Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{prezza_et_al:LIPIcs.CPM.2019.7,
  author =	{Prezza, Nicola and Rosone, Giovanna},
  title =	{{Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{7:1--7:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.7},
  URN =		{urn:nbn:de:0030-drops-104782},
  doi =		{10.4230/LIPIcs.CPM.2019.7},
  annote =	{Keywords: Burrows-Wheeler Transform, LCP array, DNA reads}
}

Document

DOI: 10.4230/LIPIcs.ESA.2018.52

String Attractors: Verification and Optimization

Authors: Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg

Published in: LIPIcs, Volume 112, 26th Annual European Symposium on Algorithms (ESA 2018)

Abstract

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set Gamma subseteq [1..n] is a k-attractor for a string S in Sigma^n if and only if every distinct substring of S of length at most k has an occurrence crossing at least one of the positions in Gamma. Finding the smallest k-attractor is NP-hard for k >= 3, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the k-attractor problem to a set-cover instance where the string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a k-attractor in near-optimal time and how to quickly compute exact solutions. For example, we prove that a minimum 3-attractor can be found in O(n) time when |Sigma| in O(sqrt[3+epsilon]{log n}) for some constant epsilon > 0, despite the problem being NP-hard for large Sigma.

Cite as

Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String Attractors: Verification and Optimization. In 26th Annual European Symposium on Algorithms (ESA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 112, pp. 52:1-52:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{kempa_et_al:LIPIcs.ESA.2018.52,
  author =	{Kempa, Dominik and Policriti, Alberto and Prezza, Nicola and Rotenberg, Eva},
  title =	{{String Attractors: Verification and Optimization}},
  booktitle =	{26th Annual European Symposium on Algorithms (ESA 2018)},
  pages =	{52:1--52:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-081-1},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{112},
  editor =	{Azar, Yossi and Bast, Hannah and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2018.52},
  URN =		{urn:nbn:de:0030-drops-95153},
  doi =		{10.4230/LIPIcs.ESA.2018.52},
  annote =	{Keywords: Dictionary compression, String attractors, Set cover}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.3

Detecting Mutations by eBWT

Authors: Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.

Cite as

Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone. Detecting Mutations by eBWT. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 3:1-3:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{prezza_et_al:LIPIcs.WABI.2018.3,
  author =	{Prezza, Nicola and Pisanti, Nadia and Sciortino, Marinella and Rosone, Giovanna},
  title =	{{Detecting Mutations by eBWT}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{3:1--3:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.3},
  URN =		{urn:nbn:de:0030-drops-93051},
  doi =		{10.4230/LIPIcs.WABI.2018.3},
  annote =	{Keywords: BWT, LCP Array, SNPs, Reference-free, Assembly-free}
}

Document

DOI: 10.4230/LIPIcs.SEA.2017.11

A Framework of Dynamic Data Structures for String Processing

Authors: Nicola Prezza

Published in: LIPIcs, Volume 75, 16th International Symposium on Experimental Algorithms (SEA 2017)

Abstract

In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of state-of-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks.

Cite as

Nicola Prezza. A Framework of Dynamic Data Structures for String Processing. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 11:1-11:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{prezza:LIPIcs.SEA.2017.11,
  author =	{Prezza, Nicola},
  title =	{{A Framework of Dynamic Data Structures for String Processing}},
  booktitle =	{16th International Symposium on Experimental Algorithms (SEA 2017)},
  pages =	{11:1--11:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-036-1},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{75},
  editor =	{Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2017.11},
  URN =		{urn:nbn:de:0030-drops-76028},
  doi =		{10.4230/LIPIcs.SEA.2017.11},
  annote =	{Keywords: C++, dynamic, compression, data structure, bitvector, string}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.17

From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back

Authors: Alberto Policriti and Nicola Prezza

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes z and r closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output. Let n be the text length. We show that RLBWT can be converted to LZ77 in O(n log r) time and O(r) words of working space. Conversely, we provide an algorithm to convert LZ77 to RLBWT in O(n(log r + log z)) time and O(r+z) words of working space. Note that r and z can be constant if the text is highly repetitive, and our algorithms can operate with (up to) exponentially less space than naive solutions based on full decompression.

Cite as

Alberto Policriti and Nicola Prezza. From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 17:1-17:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{policriti_et_al:LIPIcs.CPM.2017.17,
  author =	{Policriti, Alberto and Prezza, Nicola},
  title =	{{From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{17:1--17:10},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.17},
  URN =		{urn:nbn:de:0030-drops-73215},
  doi =		{10.4230/LIPIcs.CPM.2017.17},
  annote =	{Keywords: Lempel-Ziv, Burrows-Wheeler transform, compressed computation, repetitive text collections}
}

Search Results

Documents authored by Prezza, Nicola

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message