DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2024.16

Gapped String Indexing in Subquadratic Space and Sublinear Query Time

Authors: Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner

Published in: LIPIcs, Volume 289, 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024)

Abstract

In Gapped String Indexing, the goal is to compactly represent a string S of length n such that for any query consisting of two strings P₁ and P₂, called patterns, and an integer interval [α, β], called gap range, we can quickly find occurrences of P₁ and P₂ in S with distance in [α, β]. Gapped String Indexing is a central problem in computational biology and text mining and has thus received significant research interest, including parameterized and heuristic approaches. Despite this interest, the best-known time-space trade-offs for Gapped String Indexing are the straightforward 𝒪(n) space and 𝒪(n+ occ) query time or Ω(n²) space and Õ(|P₁| + |P₂| + occ) query time. We break through this barrier obtaining the first interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every 0 ≤ δ ≤ 1, there is a data structure for Gapped String Indexing with either Õ(n^{2-δ/3}) or Õ(n^{3-2δ}) space and Õ(|P₁| + |P₂| + n^{δ}⋅ (occ+1)) query time, where occ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets S₁, …, S_k of integers such that for any query consisting of three integers i,j,s, we can quickly output YES if and only if there exist a ∈ S_i and b ∈ S_j with a+s = b. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. We then give a data structure for Shifted Set Intersection with gaps, which entails a solution to the Gapped String Indexing problem. Furthermore, we enhance our data structure for deciding Shifted Set Intersection, so that we can support the reporting variant of the problem, i.e., outputting all certificates in the affirmative case. Via the obtained equivalence to 3SUM Indexing, we thus give new improved data structures for the reporting variant of 3SUM Indexing, and we show how this improves upon the state-of-the-art solution for Jumbled Indexing [Chan and Lewenstein, STOC 2015] for any alphabet of constant size σ > 5.

Cite as

Philip Bille, Inge Li Gørtz, Moshe Lewenstein, Solon P. Pissis, Eva Rotenberg, and Teresa Anna Steiner. Gapped String Indexing in Subquadratic Space and Sublinear Query Time. In 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 289, pp. 16:1-16:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.STACS.2024.16,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Lewenstein, Moshe and Pissis, Solon P. and Rotenberg, Eva and Steiner, Teresa Anna},
  title =	{{Gapped String Indexing in Subquadratic Space and Sublinear Query Time}},
  booktitle =	{41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024)},
  pages =	{16:1--16:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-311-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{289},
  editor =	{Beyersdorff, Olaf and Kant\'{e}, Mamadou Moustapha and Kupferman, Orna and Lokshtanov, Daniel},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2024.16},
  URN =		{urn:nbn:de:0030-drops-197262},
  doi =		{10.4230/LIPIcs.STACS.2024.16},
  annote =	{Keywords: data structures, string indexing, indexing with gaps, two patterns}
}

@InProceedings{bille_et_al:LIPIcs.STACS.2024.16,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Lewenstein, Moshe and Pissis, Solon P. and Rotenberg, Eva and Steiner, Teresa Anna},
  title =	{{Gapped String Indexing in Subquadratic Space and Sublinear Query Time}},
  booktitle =	{41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024)},
  pages =	{16:1--16:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-311-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{289},
  editor =	{Beyersdorff, Olaf and Kant\'{e}, Mamadou Moustapha and Kupferman, Orna and Lokshtanov, Daniel},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2024.16},
  URN =		{urn:nbn:de:0030-drops-197262},
  doi =		{10.4230/LIPIcs.STACS.2024.16},
  annote =	{Keywords: data structures, string indexing, indexing with gaps, two patterns}
}

Document

DOI: 10.4230/LIPIcs.SEA.2023.18

Hierarchical Relative Lempel-Ziv Compression

Authors: Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow

Published in: LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)

Abstract

Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

Cite as

Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Hierarchical Relative Lempel-Ziv Compression. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 18:1-18:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.SEA.2023.18,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Puglisi, Simon J. and Tarnow, Simon R.},
  title =	{{Hierarchical Relative Lempel-Ziv Compression}},
  booktitle =	{21st International Symposium on Experimental Algorithms (SEA 2023)},
  pages =	{18:1--18:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-279-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{265},
  editor =	{Georgiadis, Loukas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.18},
  URN =		{urn:nbn:de:0030-drops-183680},
  doi =		{10.4230/LIPIcs.SEA.2023.18},
  annote =	{Keywords: Relative compression, Lempel-Ziv compression, RLZ, LZ77, string collections, compressed representation, data structures, efficient algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2023.4

Sliding Window String Indexing in Streams

Authors: Philip Bille, Johannes Fischer, Inge Li Gørtz, Max Rishøj Pedersen, and Tord Joakim Stordalen

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

Given a string S over an alphabet Σ, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log (w/δ)). In particular, for a delay of δ = ε w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.

Cite as

Philip Bille, Johannes Fischer, Inge Li Gørtz, Max Rishøj Pedersen, and Tord Joakim Stordalen. Sliding Window String Indexing in Streams. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2023.4,
  author =	{Bille, Philip and Fischer, Johannes and G{\o}rtz, Inge Li and Pedersen, Max Rish{\o}j and Stordalen, Tord Joakim},
  title =	{{Sliding Window String Indexing in Streams}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{4:1--4:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.4},
  URN =		{urn:nbn:de:0030-drops-179587},
  doi =		{10.4230/LIPIcs.CPM.2023.4},
  annote =	{Keywords: String indexing, pattern matching, sliding window, streaming}
}

Document

DOI: 10.4230/LIPIcs.CPM.2022.4

The Fine-Grained Complexity of Episode Matching

Authors: Philip Bille, Inge Li Gørtz, Shay Mozes, Teresa Anna Steiner, and Oren Weimann

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

Given two strings S and P, the Episode Matching problem is to find the shortest substring of S that contains P as a subsequence. The best known upper bound for this problem is Õ(nm) by Das et al. (1997), where n,m are the lengths of S and P, respectively. Although the problem is well studied and has many applications in data mining, this bound has never been improved. In this paper we show why this is the case by proving that no O((nm)^{1-ε}) algorithm (even for binary strings) exists, unless the Strong Exponential Time Hypothesis (SETH) is false. We then consider the indexing version of the problem, where S is preprocessed into a data structure for answering episode matching queries P. We show that for any τ, there is a data structure using O(n+(n/(τ)) ^k) space that answers episode matching queries for any P of length k in O(k⋅ τ ⋅ log log n) time. We complement this upper bound with an almost matching lower bound, showing that any data structure that answers episode matching queries for patterns of length k in time O(n^δ), must use Ω(n^{k-kδ-o(1)}) space, unless the Strong k-Set Disjointness Conjecture is false. Finally, for the special case of k = 2, we present a faster construction of the data structure using fast min-plus multiplication of bounded integer matrices.

Cite as

Philip Bille, Inge Li Gørtz, Shay Mozes, Teresa Anna Steiner, and Oren Weimann. The Fine-Grained Complexity of Episode Matching. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 4:1-4:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2022.4,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Mozes, Shay and Steiner, Teresa Anna and Weimann, Oren},
  title =	{{The Fine-Grained Complexity of Episode Matching}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{4:1--4:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.4},
  URN =		{urn:nbn:de:0030-drops-161312},
  doi =		{10.4230/LIPIcs.CPM.2022.4},
  annote =	{Keywords: Pattern matching, fine-grained complexity, longest common subsequence}
}

Document

DOI: 10.4230/LIPIcs.SWAT.2022.18

Predecessor on the Ultra-Wide Word RAM

Authors: Philip Bille, Inge Li Gørtz, and Tord Stordalen

Published in: LIPIcs, Volume 227, 18th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2022)

Abstract

We consider the predecessor problem on the ultra-wide word RAM model of computation, which extends the word RAM model with ultrawords consisting of w² bits [TAMC, 2015]. The model supports arithmetic and boolean operations on ultrawords, in addition to scattered memory operations that access or modify w (potentially non-contiguous) memory addresses simultaneously. The ultra-wide word RAM model captures (and idealizes) modern vector processor architectures. Our main result is a simple, linear space data structure that supports predecessor in constant time and updates in amortized, expected constant time. This improves the space of the previous constant time solution that uses space in the order of the size of the universe. Our result is based on a new implementation of the classic x-fast trie data structure of Willard [Inform. Process. Lett. 17(2), 1983] combined with a new dictionary data structure that supports fast parallel lookups.

Cite as

Philip Bille, Inge Li Gørtz, and Tord Stordalen. Predecessor on the Ultra-Wide Word RAM. In 18th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 227, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.SWAT.2022.18,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Stordalen, Tord},
  title =	{{Predecessor on the Ultra-Wide Word RAM}},
  booktitle =	{18th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2022)},
  pages =	{18:1--18:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-236-5},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{227},
  editor =	{Czumaj, Artur and Xin, Qin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.SWAT.2022.18},
  URN =		{urn:nbn:de:0030-drops-161786},
  doi =		{10.4230/LIPIcs.SWAT.2022.18},
  annote =	{Keywords: Ultra-wide word RAM model, predecessor, word-level parallelism}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.CPM.2021.1

Repetitions in Strings: A "Constant" Problem (Invited Talk)

Authors: Hideo Bannai

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

Repeating structures in strings is one of the most fundamental characteristics of strings, and has been an important topic in the field of combinatorics on words and combinatorial pattern matching since their beginnings. In this talk, I will focus on squares and maximal repetitions and review the "runs" theorem [Hideo Bannai et al., 2017] as well as related results (e.g. [Aviezri S. Fraenkel and Jamie Simpson, 1998; Yuta Fujishige et al., 2017; Ryo Sugahara et al., 2019; Philip Bille et al., 2020; Hideo Bannai et al., 2020; Jonas Ellert and Johannes Fischer, 2021]) which address the two main questions: how many of them can be contained in a string of given length, and algorithms for computing them.

Cite as

Hideo Bannai. Repetitions in Strings: A "Constant" Problem (Invited Talk). In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bannai:LIPIcs.CPM.2021.1,
  author =	{Bannai, Hideo},
  title =	{{Repetitions in Strings: A "Constant" Problem}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{1:1--1:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.1},
  URN =		{urn:nbn:de:0030-drops-139523},
  doi =		{10.4230/LIPIcs.CPM.2021.1},
  annote =	{Keywords: Maximal repetitions, Squares, Lyndon words}
}

Document

DOI: 10.4230/LIPIcs.CPM.2021.10

Gapped Indexing for Consecutive Occurrences

Authors: Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P₁ and P₂ and a gap range [α, β] we can quickly find the consecutive occurrences of P₁ and P₂ with distance in [α, β], i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use Õ(n) space and query time Õ(|P₁|+|P₂|+n^{2/3}) for existence and counting and Õ(|P₁|+|P₂|+n^{2/3}occ^{1/3}) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using Õ(n) space must use Ω̃(|P₁| + |P₂| + √n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

Cite as

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped Indexing for Consecutive Occurrences. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 10:1-10:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2021.10,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Pedersen, Max Rish{\o}j and Steiner, Teresa Anna},
  title =	{{Gapped Indexing for Consecutive Occurrences}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{10:1--10:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.10},
  URN =		{urn:nbn:de:0030-drops-139615},
  doi =		{10.4230/LIPIcs.CPM.2021.10},
  annote =	{Keywords: String indexing, two patterns, consecutive occurrences, conditional lower bound}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2020.48

Random Access in Persistent Strings

Authors: Philip Bille and Inge Li Gørtz

Published in: LIPIcs, Volume 181, 31st International Symposium on Algorithms and Computation (ISAAC 2020)

Abstract

We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path from the root to the node. The goal is to compactly represent the entire collection while supporting fast random access to any part of a string in the collection. This problem captures natural scenarios such as representing the past history of an edited document or representing highly-repetitive collections. Given a tree with n nodes, we show how to represent the corresponding collection in O(n) space and optimal O(log n/ log log n) query time. This improves the previous time-space trade-offs for the problem. To obtain our results, we introduce new techniques and ideas, including a reduction to a new geometric line segment selection together with an efficient solution.

Cite as

Philip Bille and Inge Li Gørtz. Random Access in Persistent Strings. In 31st International Symposium on Algorithms and Computation (ISAAC 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 181, pp. 48:1-48:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.ISAAC.2020.48,
  author =	{Bille, Philip and G{\o}rtz, Inge Li},
  title =	{{Random Access in Persistent Strings}},
  booktitle =	{31st International Symposium on Algorithms and Computation (ISAAC 2020)},
  pages =	{48:1--48:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-173-3},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{181},
  editor =	{Cao, Yixin and Cheng, Siu-Wing and Li, Minming},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2020.48},
  URN =		{urn:nbn:de:0030-drops-133922},
  doi =		{10.4230/LIPIcs.ISAAC.2020.48},
  annote =	{Keywords: Data compression, data structures, persistent strings}
}

Document

DOI: 10.4230/LIPIcs.FSTTCS.2020.14

String Indexing for Top-k Close Consecutive Occurrences

Authors: Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner

Published in: LIPIcs, Volume 182, 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P, report all occurrences of P within S. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-k close consecutive occurrences problem (Sitcco). Here, a consecutive occurrence is a pair (i,j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j-i. Given a pattern P and a parameter k, the goal is to report the top-k consecutive occurrences of P in S of minimal distance. The challenge is to compactly represent S while supporting queries in time close to the length of P and k. We give two time-space trade-offs for the problem. Let n be the length of S, m the length of P, and ε ∈ (0,1]. Our first result achieves O(nlog n) space and optimal query time of O(m+k), and our second result achieves linear space and query time O(m+k^{1+ε}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

Cite as

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. String Indexing for Top-k Close Consecutive Occurrences. In 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 182, pp. 14:1-14:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.FSTTCS.2020.14,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Pedersen, Max Rish{\o}j and Rotenberg, Eva and Steiner, Teresa Anna},
  title =	{{String Indexing for Top-k Close Consecutive Occurrences}},
  booktitle =	{40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2020)},
  pages =	{14:1--14:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-174-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{182},
  editor =	{Saxena, Nitin and Simon, Sunil},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.FSTTCS.2020.14},
  URN =		{urn:nbn:de:0030-drops-132558},
  doi =		{10.4230/LIPIcs.FSTTCS.2020.14},
  annote =	{Keywords: String indexing, pattern matching, consecutive occurrences}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2020.14

Space Efficient Construction of Lyndon Arrays in Linear Time

Authors: Philip Bille, Jonas Ellert, Johannes Fischer, Inge Li Gørtz, Florian Kurpicz, J. Ian Munro, and Eva Rotenberg

Published in: LIPIcs, Volume 168, 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)

Abstract

Given a string S of length n, its Lyndon array identifies for each suffix S[i..n] the next lexicographically smaller suffix S[j..n], i.e. the minimal index j > i with S[i..n] ≻ S[j..n]. Apart from its plain (n log₂ n)-bit array representation, the Lyndon array can also be encoded as a succinct parentheses sequence that requires only 2n bits of space. While linear time construction algorithms for both representations exist, it has previously been unknown if the same time bound can be achieved with less than Ω(n lg n) bits of additional working space. We show that, in fact, o(n) additional bits are sufficient to compute the succinct 2n-bit version of the Lyndon array in linear time. For the plain (n log₂ n)-bit version, we only need 𝒪(1) additional words to achieve linear time. Our space efficient construction algorithm makes the Lyndon array more accessible as a fundamental data structure in applications like full-text indexing.

Cite as

Philip Bille, Jonas Ellert, Johannes Fischer, Inge Li Gørtz, Florian Kurpicz, J. Ian Munro, and Eva Rotenberg. Space Efficient Construction of Lyndon Arrays in Linear Time. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 168, pp. 14:1-14:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.ICALP.2020.14,
  author =	{Bille, Philip and Ellert, Jonas and Fischer, Johannes and G{\o}rtz, Inge Li and Kurpicz, Florian and Munro, J. Ian and Rotenberg, Eva},
  title =	{{Space Efficient Construction of Lyndon Arrays in Linear Time}},
  booktitle =	{47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)},
  pages =	{14:1--14:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-138-2},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{168},
  editor =	{Czumaj, Artur and Dawar, Anuj and Merelli, Emanuela},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2020.14},
  URN =		{urn:nbn:de:0030-drops-124211},
  doi =		{10.4230/LIPIcs.ICALP.2020.14},
  annote =	{Keywords: String algorithms, string suffixes, succinct data structures, Lyndon word, Lyndon array, nearest smaller values, nearest smaller suffixes}
}

@InProceedings{bille_et_al:LIPIcs.ICALP.2020.14,
  author =	{Bille, Philip and Ellert, Jonas and Fischer, Johannes and G{\o}rtz, Inge Li and Kurpicz, Florian and Munro, J. Ian and Rotenberg, Eva},
  title =	{{Space Efficient Construction of Lyndon Arrays in Linear Time}},
  booktitle =	{47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)},
  pages =	{14:1--14:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-138-2},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{168},
  editor =	{Czumaj, Artur and Dawar, Anuj and Merelli, Emanuela},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2020.14},
  URN =		{urn:nbn:de:0030-drops-124211},
  doi =		{10.4230/LIPIcs.ICALP.2020.14},
  annote =	{Keywords: String algorithms, string suffixes, succinct data structures, Lyndon word, Lyndon array, nearest smaller values, nearest smaller suffixes}
}

Document

DOI: 10.4230/LIPIcs.STACS.2020.10

String Indexing with Compressed Patterns

Authors: Philip Bille, Inge Li Gørtz, and Teresa Anna Steiner

Published in: LIPIcs, Volume 154, 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)

Abstract

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

Cite as

Philip Bille, Inge Li Gørtz, and Teresa Anna Steiner. String Indexing with Compressed Patterns. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.STACS.2020.10,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Steiner, Teresa Anna},
  title =	{{String Indexing with Compressed Patterns}},
  booktitle =	{37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-140-5},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{154},
  editor =	{Paul, Christophe and Bl\"{a}ser, Markus},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2020.10},
  URN =		{urn:nbn:de:0030-drops-118716},
  doi =		{10.4230/LIPIcs.STACS.2020.10},
  annote =	{Keywords: string indexing, compression, pattern matching}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2019.4

Top Tree Compression of Tries

Authors: Philip Bille, Paweł Gawrychowski, Inge Li Gørtz, Gad M. Landau, and Oren Weimann

Published in: LIPIcs, Volume 149, 30th International Symposium on Algorithms and Computation (ISAAC 2019)

Abstract

We present a compressed representation of tries based on top tree compression [ICALP 2013] that works on a standard, comparison-based, pointer machine model of computation and supports efficient prefix search queries. Namely, we show how to preprocess a set of strings of total length n over an alphabet of size sigma into a compressed data structure of worst-case optimal size O(n/log_sigma n) that given a pattern string P of length m determines if P is a prefix of one of the strings in time O(min(m log sigma,m + log n)). We show that this query time is in fact optimal regardless of the size of the data structure. Existing solutions either use Omega(n) space or rely on word RAM techniques, such as tabulation, hashing, address arithmetic, or word-level parallelism, and hence do not work on a pointer machine. Our result is the first solution on a pointer machine that achieves worst-case o(n) space. Along the way, we develop several interesting data structures that work on a pointer machine and are of independent interest. These include an optimal data structures for random access to a grammar-compressed string and an optimal data structure for a variant of the level ancestor problem.

Cite as

Philip Bille, Paweł Gawrychowski, Inge Li Gørtz, Gad M. Landau, and Oren Weimann. Top Tree Compression of Tries. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.ISAAC.2019.4,
  author =	{Bille, Philip and Gawrychowski, Pawe{\l} and G{\o}rtz, Inge Li and Landau, Gad M. and Weimann, Oren},
  title =	{{Top Tree Compression of Tries}},
  booktitle =	{30th International Symposium on Algorithms and Computation (ISAAC 2019)},
  pages =	{4:1--4:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-130-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{149},
  editor =	{Lu, Pinyan and Zhang, Guochuan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2019.4},
  URN =		{urn:nbn:de:0030-drops-115000},
  doi =		{10.4230/LIPIcs.ISAAC.2019.4},
  annote =	{Keywords: pattern matching, tree compression, top trees, pointer machine}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2019.71

From Regular Expression Matching to Parsing

Authors: Philip Bille and Inge Li Gørtz

Published in: LIPIcs, Volume 138, 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)

Abstract

Given a regular expression R and a string Q, the regular expression parsing problem is to determine if Q matches R and if so, determine how it matches, e.g., by a mapping of the characters of Q to the characters in R. Regular expression parsing makes finding matches of a regular expression even more useful by allowing us to directly extract subpatterns of the match, e.g., for extracting IP-addresses from internet traffic analysis or extracting subparts of genomes from genetic data bases. We present a new general techniques for efficiently converting a large class of algorithms that determine if a string Q matches regular expression R into algorithms that can construct a corresponding mapping. As a consequence, we obtain the first efficient linear space solutions for regular expression parsing.

Cite as

Philip Bille and Inge Li Gørtz. From Regular Expression Matching to Parsing. In 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 138, pp. 71:1-71:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.MFCS.2019.71,
  author =	{Bille, Philip and G{\o}rtz, Inge Li},
  title =	{{From Regular Expression Matching to Parsing}},
  booktitle =	{44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)},
  pages =	{71:1--71:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-117-7},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{138},
  editor =	{Rossmanith, Peter and Heggernes, Pinar and Katoen, Joost-Pieter},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2019.71},
  URN =		{urn:nbn:de:0030-drops-110150},
  doi =		{10.4230/LIPIcs.MFCS.2019.71},
  annote =	{Keywords: regular expressions, finite automata, regular expression parsing, algorithms}
}

Document

DOI: 10.4230/LIPIcs.ESA.2017.16

Fast Dynamic Arrays

Authors: Philip Bille, Anders Roy Christiansen, Mikko Berggren Ettienne, and Inge Li Gørtz

Published in: LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

Abstract

We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(n^e) for e > 0 while using o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and set from the standard library on sequences with up to 10^8 elements. Our fastest implementation uses much less space than set while providing speedups of 40x for access operations compared to set and speedups of 10.000x compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations.

Cite as

Philip Bille, Anders Roy Christiansen, Mikko Berggren Ettienne, and Inge Li Gørtz. Fast Dynamic Arrays. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 16:1-16:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.ESA.2017.16,
  author =	{Bille, Philip and Christiansen, Anders Roy and Ettienne, Mikko Berggren and G{\o}rtz, Inge Li},
  title =	{{Fast Dynamic Arrays}},
  booktitle =	{25th Annual European Symposium on Algorithms (ESA 2017)},
  pages =	{16:1--16:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-049-1},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{87},
  editor =	{Pruhs, Kirk and Sohler, Christian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.16},
  URN =		{urn:nbn:de:0030-drops-78309},
  doi =		{10.4230/LIPIcs.ESA.2017.16},
  annote =	{Keywords: Dynamic Arrays, Tiered Vectors}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.6

Deterministic Indexing for Packed Strings

Authors: Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In the deterministic variant the goal is to solve the string indexing problem without any randomization (at preprocessing time or query time). In the packed variant the strings are stored with several character in a single word, giving us the opportunity to read multiple characters simultaneously. Our main result is a new string index in the deterministic and packed setting. Given a packed string S of length n over an alphabet s, we show how to preprocess S in O(n) (deterministic) time and space O(n) such that given a packed pattern string of length m we can support queries in (deterministic) time O(m/a + log m + log log s), where a = w /log s is the number of characters packed in a word of size w = log n. Our query time is always at least as good as the previous best known bounds and whenever several characters are packed in a word, i.e., log s << w, the query times are faster.

Cite as

Philip Bille, Inge Li Gørtz, and Frederik Rye Skjoldjensen. Deterministic Indexing for Packed Strings. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 6:1-6:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2017.6,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Skjoldjensen, Frederik Rye},
  title =	{{Deterministic Indexing for Packed Strings}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{6:1--6:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.6},
  URN =		{urn:nbn:de:0030-drops-73351},
  doi =		{10.4230/LIPIcs.CPM.2017.6},
  annote =	{Keywords: suffix tree, suffix array, deterministic algorithm, word packing}
}

22 Search Results for "Bille, Philip"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message