Search Results

Documents authored by Tarnow, Simon R.


Document
Compressed Dictionary Matching on Run-Length Encoded Strings

Authors: Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)


Abstract
Given a set of pattern strings 𝒫 = {P₁, P₂,… P_k} and a text string S, the classic dictionary matching problem is to report all occurrences of each pattern in S. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let m and n be the total length of the patterns 𝒫 and the length of the text string S, respectively, and let ̅m and ̅n be the total number of runs in the run-length encoding of the patterns in 𝒫 and S, respectively. Our main result is an algorithm that achieves O(( ̅m + ̅n)log log m + occ) expected time, and O( ̅m) space, where occ is the total number of occurrences of patterns in S. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an log log m factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.

Cite as

Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Compressed Dictionary Matching on Run-Length Encoded Strings. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 21:1-21:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2025.21,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Puglisi, Simon J. and Tarnow, Simon R.},
  title =	{{Compressed Dictionary Matching on Run-Length Encoded Strings}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{21:1--21:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.21},
  URN =		{urn:nbn:de:0030-drops-231158},
  doi =		{10.4230/LIPIcs.CPM.2025.21},
  annote =	{Keywords: Dictionary matching, run-length encoding, compressed pattern matching}
}
Document
Succinct Data Structures for Segments

Authors: Philip Bille, Inge Li Gørtz, and Simon R. Tarnow

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)


Abstract
We consider succinct data structures for representing a set of n horizontal line segments in the plane given in rank space to support segment access, segment selection, and segment rank queries. A segment access query finds the segment (x₁, x₂, y) given its y-coordinate (y-coordinates of the segments are distinct), a segment selection query finds the jth smallest segment (the segment with the jth smallest y-coordinate) among the segments crossing the vertical line for a given x-coordinate, and a segment rank query finds the number of segments crossing the vertical line through x-coordinate i with y-coordinate at most y, for a given x and y. This problem is a central component in compressed data structures for persistent strings supporting random access. Our main result is a data structure using 2n lg n + O(n lg n / lg lg n) bits of space and O(lg n / lg lg n) query time for all operations. We show that this space bound is optimal up to lower-order terms. We will also show that the query time for segment rank is optimal. The query time for segment selection is also optimal by a previous bound. To obtain our results, we present a novel segment wavelet tree data structure of independent interest. This structure is inspired by and extends the classic wavelet tree for sequences. This leads to a simple, succinct solution with O(log n) query times. We then extend this solution to obtain optimal query time. Our space lower bound follows from a simple counting argument, and our lower bound for segment rank is obtained by a reduction from 2-dimensional counting.

Cite as

Philip Bille, Inge Li Gørtz, and Simon R. Tarnow. Succinct Data Structures for Segments. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 27:1-27:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2025.27,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Tarnow, Simon R.},
  title =	{{Succinct Data Structures for Segments}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{27:1--27:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.27},
  URN =		{urn:nbn:de:0030-drops-231218},
  doi =		{10.4230/LIPIcs.CPM.2025.27},
  annote =	{Keywords: Succinct, Data structures, Selection}
}
Document
Faster Sliding Window String Indexing in Streams

Authors: Philip Bille, Paweł Gawrychowski, Inge Li Gørtz, and Simon R. Tarnow

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)


Abstract
The classical string indexing problem asks to preprocess the input string S for efficient pattern matching queries. Bille, Fischer, Gørtz, Pedersen, and Stordalen [CPM 2023] generalized this to the {streaming sliding window string indexing} problem, where the input string S arrives as a stream, and we are asked to maintain an index of the last w characters, called the window. Further, at any point in time, a pattern P might appear, again given as a stream, and all occurrences of P in the current window must be output. We require that the time to process each character of the text or the pattern is worst-case. It appears that standard string indexing structures, such as suffix trees, do not provide an efficient solution in such a setting, as to obtain a good worst-case bound, they necessarily need to work right-to-left, and we cannot reverse the pattern while keeping a worst-case guarantee on the time to process each of its characters. Nevertheless, it is possible to obtain a bound of 𝒪(log w) (with high probability) by maintaining a hierarchical structure of multiple suffix trees. We significantly improve this upper bound by designing a black-box reduction to maintain a suffix tree under prepending characters to the current text. By plugging in the known results, this allows us to obtain a bound of 𝒪(log log w +log log σ) (with high probability), where σ is the size of the alphabet. Further, we introduce an even more general problem, called the {streaming dynamic window string indexing}, where the goal is to maintain the current text under adding and deleting characters at either end and design a similar black-box reduction.

Cite as

Philip Bille, Paweł Gawrychowski, Inge Li Gørtz, and Simon R. Tarnow. Faster Sliding Window String Indexing in Streams. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 8:1-8:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.CPM.2024.8,
  author =	{Bille, Philip and Gawrychowski, Pawe{\l} and G{\o}rtz, Inge Li and Tarnow, Simon R.},
  title =	{{Faster Sliding Window String Indexing in Streams}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{8:1--8:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.8},
  URN =		{urn:nbn:de:0030-drops-201183},
  doi =		{10.4230/LIPIcs.CPM.2024.8},
  annote =	{Keywords: data structures, pattern matching, string indexing}
}
Document
Hierarchical Relative Lempel-Ziv Compression

Authors: Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow

Published in: LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)


Abstract
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance.

Cite as

Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow. Hierarchical Relative Lempel-Ziv Compression. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 18:1-18:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{bille_et_al:LIPIcs.SEA.2023.18,
  author =	{Bille, Philip and G{\o}rtz, Inge Li and Puglisi, Simon J. and Tarnow, Simon R.},
  title =	{{Hierarchical Relative Lempel-Ziv Compression}},
  booktitle =	{21st International Symposium on Experimental Algorithms (SEA 2023)},
  pages =	{18:1--18:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-279-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{265},
  editor =	{Georgiadis, Loukas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.18},
  URN =		{urn:nbn:de:0030-drops-183680},
  doi =		{10.4230/LIPIcs.SEA.2023.18},
  annote =	{Keywords: Relative compression, Lempel-Ziv compression, RLZ, LZ77, string collections, compressed representation, data structures, efficient algorithms}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail