DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2024.24

Approximate Circular Pattern Matching Under Edit Distance

Authors: Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 289, 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024)

Abstract

In the k-Edit Circular Pattern Matching (k-Edit CPM) problem, we are given a length-n text T, a length-m pattern P, and a positive integer threshold k, and we are to report all starting positions of the substrings of T that are at edit distance at most k from some cyclic rotation of P. In the decision version of the problem, we are to check if any such substring exists. Very recently, Charalampopoulos et al. [ESA 2022] presented 𝒪(nk²)-time and 𝒪(nk log³ k)-time solutions for the reporting and decision versions of k-Edit CPM, respectively. Here, we show that the reporting and decision versions of k-Edit CPM can be solved in 𝒪(n+(n/m) k⁶) time and 𝒪(n+(n/m) k⁵ log³ k) time, respectively, thus obtaining the first algorithms with a complexity of the type 𝒪(n+(n/m) poly(k)) for this problem. Notably, our algorithms run in 𝒪(n) time when m = Ω(k⁶) and are superior to the previous respective solutions when m = ω(k⁴). We provide a meta-algorithm that yields efficient algorithms in several other interesting settings, such as when the strings are given in a compressed form (as straight-line programs), when the strings are dynamic, or when we have a quantum computer. We obtain our solutions by exploiting the structure of approximate circular occurrences of P in T, when T is relatively short w.r.t. P. Roughly speaking, either the starting positions of approximate occurrences of rotations of P form 𝒪(k⁴) intervals that can be computed efficiently, or some rotation of P is almost periodic (is at a small edit distance from a string with small period). Dealing with the almost periodic case is the most technically demanding part of this work; we tackle it using properties of locked fragments (originating from [Cole and Hariharan, SICOMP 2002]).

Cite as

Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Approximate Circular Pattern Matching Under Edit Distance. In 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 289, pp. 24:1-24:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.STACS.2024.24,
  author =	{Charalampopoulos, Panagiotis and Pissis, Solon P. and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Approximate Circular Pattern Matching Under Edit Distance}},
  booktitle =	{41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024)},
  pages =	{24:1--24:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-311-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{289},
  editor =	{Beyersdorff, Olaf and Kant\'{e}, Mamadou Moustapha and Kupferman, Orna and Lokshtanov, Daniel},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2024.24},
  URN =		{urn:nbn:de:0030-drops-197346},
  doi =		{10.4230/LIPIcs.STACS.2024.24},
  annote =	{Keywords: circular pattern matching, approximate pattern matching, edit distance}
}

@InProceedings{charalampopoulos_et_al:LIPIcs.STACS.2024.24,
  author =	{Charalampopoulos, Panagiotis and Pissis, Solon P. and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Approximate Circular Pattern Matching Under Edit Distance}},
  booktitle =	{41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024)},
  pages =	{24:1--24:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-311-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{289},
  editor =	{Beyersdorff, Olaf and Kant\'{e}, Mamadou Moustapha and Kupferman, Orna and Lokshtanov, Daniel},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2024.24},
  URN =		{urn:nbn:de:0030-drops-197346},
  doi =		{10.4230/LIPIcs.STACS.2024.24},
  annote =	{Keywords: circular pattern matching, approximate pattern matching, edit distance}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2023.10

Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

Authors: Gabriel Bathie, Tomasz Kociumaka, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 283, 34th International Symposium on Algorithms and Computation (ISAAC 2023)

Abstract

We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language L, we are given a string T of length n, and the task is to compute the minimal distance to L from every prefix of T. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold k. In this work, our contribution is twofold: 1) First, we show streaming algorithms, which access the input string T only through a single left-to-right scan. Both for palindromes and squares, our algorithms use O(k polylog n) space and time per character in the Hamming-distance case and O(k² polylog n) space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in n. 2) Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of T. Both for palindromes and squares, our algorithms use O(k polylog n) space and time per character in the Hamming-distance case and O(k⁴ polylog n) space and amortised time per character in the edit-distance case.

Cite as

Gabriel Bathie, Tomasz Kociumaka, and Tatiana Starikovskaya. Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares. In 34th International Symposium on Algorithms and Computation (ISAAC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 283, pp. 10:1-10:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{bathie_et_al:LIPIcs.ISAAC.2023.10,
  author =	{Bathie, Gabriel and Kociumaka, Tomasz and Starikovskaya, Tatiana},
  title =	{{Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares}},
  booktitle =	{34th International Symposium on Algorithms and Computation (ISAAC 2023)},
  pages =	{10:1--10:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-289-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{283},
  editor =	{Iwata, Satoru and Kakimura, Naonori},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2023.10},
  URN =		{urn:nbn:de:0030-drops-193124},
  doi =		{10.4230/LIPIcs.ISAAC.2023.10},
  annote =	{Keywords: Approximate pattern matching, streaming algorithms, palindromes, squares}
}

Document

DOI: 10.4230/LIPIcs.ESA.2023.72

Bellman-Ford Is Optimal for Shortest Hop-Bounded Paths

Authors: Tomasz Kociumaka and Adam Polak

Published in: LIPIcs, Volume 274, 31st Annual European Symposium on Algorithms (ESA 2023)

Abstract

This paper is about the problem of finding a shortest s-t path using at most h edges in edge-weighted graphs. The Bellman-Ford algorithm solves this problem in O(hm) time, where m is the number of edges. We show that this running time is optimal, up to subpolynomial factors, under popular fine-grained complexity assumptions. More specifically, we show that under the APSP Hypothesis the problem cannot be solved faster already in undirected graphs with nonnegative edge weights. This lower bound holds even restricted to graphs of arbitrary density and for arbitrary h ∈ O(√m). Moreover, under a stronger assumption, namely the Min-Plus Convolution Hypothesis, we can eliminate the restriction h ∈ O(√m). In other words, the O(hm) bound is tight for the entire space of parameters h, m, and n, where n is the number of nodes. Our lower bounds can be contrasted with the recent near-linear time algorithm for the negative-weight Single-Source Shortest Paths problem, which is the textbook application of the Bellman-Ford algorithm.

Cite as

Tomasz Kociumaka and Adam Polak. Bellman-Ford Is Optimal for Shortest Hop-Bounded Paths. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 72:1-72:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{kociumaka_et_al:LIPIcs.ESA.2023.72,
  author =	{Kociumaka, Tomasz and Polak, Adam},
  title =	{{Bellman-Ford Is Optimal for Shortest Hop-Bounded Paths}},
  booktitle =	{31st Annual European Symposium on Algorithms (ESA 2023)},
  pages =	{72:1--72:10},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-295-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{274},
  editor =	{G{\o}rtz, Inge Li and Farach-Colton, Martin and Puglisi, Simon J. and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2023.72},
  URN =		{urn:nbn:de:0030-drops-187257},
  doi =		{10.4230/LIPIcs.ESA.2023.72},
  annote =	{Keywords: Fine-grained complexity, graph algorithms, lower bounds, shortest paths}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2023.22

Streaming k-Edit Approximate Pattern Matching via String Decomposition

Authors: Sudatta Bhattacharya and Michal Koucký

Published in: LIPIcs, Volume 261, 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023)

Abstract

In this paper we give an algorithm for streaming k-edit approximate pattern matching which uses space Õ(k²) and time Õ(k²) per arriving symbol. This improves substantially on the recent algorithm of Kociumaka, Porat and Starikovskaya [Kociumaka et al., 2022] which uses space Õ(k⁵) and time Õ(k⁸) per arriving symbol. In the k-edit approximate pattern matching problem we get a pattern P and text T and we want to identify all substrings of the text T that are at edit distance at most k from P. In the streaming version of this problem both the pattern and the text arrive in a streaming fashion symbol by symbol and after each symbol of the text we need to report whether there is a current suffix of the text with edit distance at most k from P. We measure the total space needed by the algorithm and time needed per arriving symbol.

Cite as

Sudatta Bhattacharya and Michal Koucký. Streaming k-Edit Approximate Pattern Matching via String Decomposition. In 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 261, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{bhattacharya_et_al:LIPIcs.ICALP.2023.22,
  author =	{Bhattacharya, Sudatta and Kouck\'{y}, Michal},
  title =	{{Streaming k-Edit Approximate Pattern Matching via String Decomposition}},
  booktitle =	{50th International Colloquium on Automata, Languages, and Programming (ICALP 2023)},
  pages =	{22:1--22:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-278-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{261},
  editor =	{Etessami, Kousha and Feige, Uriel and Puppis, Gabriele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2023.22},
  URN =		{urn:nbn:de:0030-drops-180741},
  doi =		{10.4230/LIPIcs.ICALP.2023.22},
  annote =	{Keywords: Approximate pattern matching, edit distance, streaming algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2023.15

Linear-Time Computation of Cyclic Roots and Cyclic Covers of a String

Authors: Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

Cyclic versions of covers and roots of a string are considered in this paper. A prefix V of a string S is a cyclic root of S if S is a concatenation of cyclic rotations of V. A prefix V of S is a cyclic cover of S if the occurrences of the cyclic rotations of V cover all positions of S. We present 𝒪(n)-time algorithms computing all cyclic roots (using number-theoretic tools) and all cyclic covers (using tools related to seeds) of a length-n string over an integer alphabet. Our results improve upon 𝒪(n log log n) and 𝒪(n log n) time complexities of recent algorithms of Grossi et al. (WALCOM 2023) for the respective problems and provide novel approaches to the problems. As a by-product, we obtain an optimal data structure for Internal Circular Pattern Matching queries that generalize Internal Pattern Matching and Cyclic Equivalence queries of Kociumaka et al. (SODA 2015).

Cite as

Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Linear-Time Computation of Cyclic Roots and Cyclic Covers of a String. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{iliopoulos_et_al:LIPIcs.CPM.2023.15,
  author =	{Iliopoulos, Costas S. and Kociumaka, Tomasz and Radoszewski, Jakub and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Linear-Time Computation of Cyclic Roots and Cyclic Covers of a String}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.15},
  URN =		{urn:nbn:de:0030-drops-179697},
  doi =		{10.4230/LIPIcs.CPM.2023.15},
  annote =	{Keywords: cyclic cover, cyclic root, circular pattern matching, internal pattern matching}
}

Document

DOI: 10.4230/LIPIcs.ITCS.2023.58

An Algorithmic Bridge Between Hamming and Levenshtein Distances

Authors: Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Barna Saha

Published in: LIPIcs, Volume 251, 14th Innovations in Theoretical Computer Science Conference (ITCS 2023)

Abstract

The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k.

Cite as

Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Barna Saha. An Algorithmic Bridge Between Hamming and Levenshtein Distances. In 14th Innovations in Theoretical Computer Science Conference (ITCS 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 251, pp. 58:1-58:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{goldenberg_et_al:LIPIcs.ITCS.2023.58,
  author =	{Goldenberg, Elazar and Kociumaka, Tomasz and Krauthgamer, Robert and Saha, Barna},
  title =	{{An Algorithmic Bridge Between Hamming and Levenshtein Distances}},
  booktitle =	{14th Innovations in Theoretical Computer Science Conference (ITCS 2023)},
  pages =	{58:1--58:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-263-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{251},
  editor =	{Tauman Kalai, Yael},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ITCS.2023.58},
  URN =		{urn:nbn:de:0030-drops-175615},
  doi =		{10.4230/LIPIcs.ITCS.2023.58},
  annote =	{Keywords: edit distance, Hamming distance, Longest Common Extension queries}
}

Document

DOI: 10.4230/LIPIcs.ESA.2022.35

Approximate Circular Pattern Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 244, 30th Annual European Symposium on Algorithms (ESA 2022)

Abstract

We investigate the complexity of approximate circular pattern matching (CPM, in short) under the Hamming and edit distance. Under each of these two basic metrics, we are given a length-n text T, a length-m pattern P, and a positive integer threshold k, and we are to report all starting positions (called occurrences) of fragments of T that are at distance at most k from some cyclic rotation of P. In the decision version of the problem, we are to check if there is any such occurrence. All previous results for approximate CPM were either average-case upper bounds or heuristics, with the exception of the work of Charalampopoulos et al. [CKP^+, JCSS'21], who considered only the Hamming distance. For the reporting version of the approximate CPM problem, under the Hamming distance we improve upon the main algorithm of [CKP^+, JCSS'21] from 𝒪(n+(n/m) ⋅ k⁴) to 𝒪(n+(n/m) ⋅ k³ log log k) time; for the edit distance, we give an 𝒪(nk²)-time algorithm. Notably, for the decision versions and wide parameter-ranges, we give algorithms whose complexities are almost identical to the state-of-the-art for standard (i.e., non-circular) approximate pattern matching: - For the decision version of the approximate CPM problem under the Hamming distance, we obtain an 𝒪(n+(n/m) ⋅ k² log k / log log k)-time algorithm, which works in 𝒪(n) time whenever k = 𝒪(√{m log log m / log m}). In comparison, the fastest algorithm for the standard counterpart of the problem, by Chan et al. [CGKKP, STOC’20], runs in 𝒪(n) time only for k = 𝒪(√m). We achieve this result via a reduction to a geometric problem by building on ideas from [CKP^+, JCSS'21] and Charalampopoulos et al. [CKW, FOCS'20]. - For the decision version of the approximate CPM problem under the edit distance, the 𝒪(nklog³ k) runtime of our algorithm near matches the 𝒪(nk) runtime of the Landau-Vishkin algorithm [LV, J. Algorithms'89] for approximate pattern matching under edit distance; the latter algorithm remains the fastest known for k = Ω(m^{2/5}). As a stepping stone, we propose an 𝒪(nklog³ k)-time algorithm for solving the Longest Prefix k'-Approximate Match problem, proposed by Landau et al. [LMS, SICOMP'98], for all k' ∈ {1,…,k}. Our algorithm is based on Tiskin’s theory of seaweeds [Tiskin, Math. Comput. Sci.'08], with recent advancements (see Charalampopoulos et al. [CKW, FOCS'22]), and on exploiting the seaweeds' relation to Monge matrices. In contrast, we obtain a conditional lower bound that suggests a polynomial separation between approximate CPM under the Hamming distance over the binary alphabet and its non-circular counterpart. We also show that a strongly subquadratic-time algorithm for the decision version of approximate CPM under edit distance would refute the Strong Exponential Time Hypothesis.

Cite as

Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P. Pissis, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Approximate Circular Pattern Matching. In 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 35:1-35:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.ESA.2022.35,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Radoszewski, Jakub and Pissis, Solon P. and Rytter, Wojciech and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Approximate Circular Pattern Matching}},
  booktitle =	{30th Annual European Symposium on Algorithms (ESA 2022)},
  pages =	{35:1--35:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-247-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{244},
  editor =	{Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.35},
  URN =		{urn:nbn:de:0030-drops-169738},
  doi =		{10.4230/LIPIcs.ESA.2022.35},
  annote =	{Keywords: approximate circular pattern matching, Hamming distance, edit distance}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2022.49

Improved Approximation Algorithms for Dyck Edit Distance and RNA Folding

Authors: Debarati Das, Tomasz Kociumaka, and Barna Saha

Published in: LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

Abstract

The Dyck language, which consists of well-balanced sequences of parentheses, is one of the most fundamental context-free languages. The Dyck edit distance quantifies the number of edits (character insertions, deletions, and substitutions) required to make a given length-n parenthesis sequence well-balanced. RNA Folding involves a similar problem, where a closing parenthesis can match an opening parenthesis of the same type irrespective of their ordering. For example, in RNA Folding, both () and )( are valid matches, whereas the Dyck language only allows () as a match. Both of these problems have been studied extensively in the literature. Using fast matrix multiplication, it is possible to compute their exact solutions in time O(n^2.687) (Chi, Duan, Xie, Zhang, STOC'22), and a (1+ε)-multiplicative approximation is known with a running time of Ω(n^2.372). The impracticality of fast matrix multiplication often makes combinatorial algorithms much more desirable. Unfortunately, it is known that the problems of (exactly) computing the Dyck edit distance and the folding distance are at least as hard as Boolean matrix multiplication. Thereby, they are unlikely to admit truly subcubic-time combinatorial algorithms. In terms of fast approximation algorithms that are combinatorial in nature, the state of the art for Dyck edit distance is an O(log n)-factor approximation algorithm that runs in near-linear time (Saha, FOCS'14), whereas for RNA Folding only an ε n-additive approximation in Õ(n²/ε) time (Saha, FOCS'17) is known. In this paper, we make substantial improvements to the state of the art for Dyck edit distance (with any number of parenthesis types). We design a constant-factor approximation algorithm that runs in Õ(n^1.971) time (the first constant-factor approximation in subquadratic time). Moreover, we develop a (1+ε)-factor approximation algorithm running in Õ(n²/ε) time, which improves upon the earlier additive approximation. Finally, we design a (3+ε)-approximation that takes Õ(nd/ε) time, where d ≥ 1 is an upper bound on the sought distance. As for RNA folding, for any s ≥ 1, we design a factor-s approximation algorithm that runs in O(n+(n/s)³) time. To the best of our knowledge, this is the first nontrivial approximation algorithm for RNA Folding that can go below the n² barrier. All our algorithms are combinatorial in nature.

Cite as

Debarati Das, Tomasz Kociumaka, and Barna Saha. Improved Approximation Algorithms for Dyck Edit Distance and RNA Folding. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 49:1-49:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{das_et_al:LIPIcs.ICALP.2022.49,
  author =	{Das, Debarati and Kociumaka, Tomasz and Saha, Barna},
  title =	{{Improved Approximation Algorithms for Dyck Edit Distance and RNA Folding}},
  booktitle =	{49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)},
  pages =	{49:1--49:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-235-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{229},
  editor =	{Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022.49},
  URN =		{urn:nbn:de:0030-drops-163902},
  doi =		{10.4230/LIPIcs.ICALP.2022.49},
  annote =	{Keywords: Dyck Edit Distance, RNA Folding, String Algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2022.18

The Dynamic k-Mismatch Problem

Authors: Raphaël Clifford, Paweł Gawrychowski, Tomasz Kociumaka, Daniel P. Martin, and Przemysław Uznański

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length m and all length-m substrings of a given text of length n ≥ m. We focus on the well-studied k-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold k. Moreover, we assume n ≤ 2m (in general, one can partition the text into overlapping blocks). In this work, we develop data structures for the dynamic version of the k-mismatch problem supporting two operations: An update performs a single-letter substitution in the pattern or the text, whereas a query, given an index i, returns the Hamming distance between the pattern and the text substring starting at position i, or reports that the distance exceeds k. First, we describe a simple data structure with 𝒪̃(1) update time and 𝒪̃(k) query time. Through considerably more sophisticated techniques, we show that 𝒪̃(k) update time and 𝒪̃(1) query time is also achievable. These two solutions likely provide an essentially optimal trade-off for the dynamic k-mismatch problem with m^{Ω(1)} ≤ k ≤ √m: we prove that, in that case, conditioned on the 3SUM conjecture, one cannot simultaneously achieve k^{1-Ω(1)} time for all operations (updates and queries) after n^{𝒪(1)}-time initialization. For k ≥ √m, the same lower bound excludes achieving m^{1/2-Ω(1)} time per operation. This is known to be essentially tight for constant-sized alphabets: already Clifford et al. (STACS 2018) achieved 𝒪̃(√m) time per operation in that case, but their solution for large alphabets costs 𝒪̃(m^{3/4}) time per operation. We improve and extend the latter result by developing a trade-off algorithm that, given a parameter 1 ≤ x ≤ k, achieves update time 𝒪̃(m/k +√{mk/x}) and query time 𝒪̃(x). In particular, for k ≥ √m, an appropriate choice of x yields 𝒪̃(∛{mk}) time per operation, which is 𝒪̃(m^{2/3}) when only the trivial threshold k = m is provided.

Cite as

Raphaël Clifford, Paweł Gawrychowski, Tomasz Kociumaka, Daniel P. Martin, and Przemysław Uznański. The Dynamic k-Mismatch Problem. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{clifford_et_al:LIPIcs.CPM.2022.18,
  author =	{Clifford, Rapha\"{e}l and Gawrychowski, Pawe{\l} and Kociumaka, Tomasz and Martin, Daniel P. and Uzna\'{n}ski, Przemys{\l}aw},
  title =	{{The Dynamic k-Mismatch Problem}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{18:1--18:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.18},
  URN =		{urn:nbn:de:0030-drops-161454},
  doi =		{10.4230/LIPIcs.CPM.2022.18},
  annote =	{Keywords: Pattern matching, Hamming distance, dynamic algorithms}
}

Document

DOI: 10.4230/LIPIcs.ESA.2021.30

Faster Algorithms for Longest Common Substring

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski

Published in: LIPIcs, Volume 204, 29th Annual European Symposium on Algorithms (ESA 2021)

Abstract

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an 𝒪(n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an 𝒪(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in 𝒪(n log σ/log n) space and read in 𝒪(n log σ/log n) time. We show that, in this model, we can compute an LCS in time 𝒪(n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = 𝒪(1)), using optimal space 𝒪(n log σ/log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in 𝒪(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in 𝒪(n log^k n) time for k = 𝒪(1) [J. Comput. Biol. 2016]. We show an 𝒪(n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using 𝒪(n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors.

Cite as

Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Faster Algorithms for Longest Common Substring. In 29th Annual European Symposium on Algorithms (ESA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 204, pp. 30:1-30:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.ESA.2021.30,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Pissis, Solon P. and Radoszewski, Jakub},
  title =	{{Faster Algorithms for Longest Common Substring}},
  booktitle =	{29th Annual European Symposium on Algorithms (ESA 2021)},
  pages =	{30:1--30:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-204-4},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{204},
  editor =	{Mutzel, Petra and Pagh, Rasmus and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2021.30},
  URN =		{urn:nbn:de:0030-drops-146114},
  doi =		{10.4230/LIPIcs.ESA.2021.30},
  annote =	{Keywords: longest common substring, k mismatches, wavelet tree}
}

Document

DOI: 10.4230/LIPIcs.ESA.2021.64

Bidirectional String Anchors: A New String Sampling Mechanism

Authors: Grigorios Loukides and Solon P. Pissis

Published in: LIPIcs, Volume 204, 29th Annual European Symposium on Algorithms (ESA 2021)

Abstract

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers w and k, it selects the lexicographically smallest length-k substring in every fragment of w consecutive length-k substrings (in every sliding window of length w+k-1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of w and k; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer 𝓁, our mechanism selects the lexicographically smallest rotation in every length-𝓁 fragment (in every sliding window of length 𝓁). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to 𝓁; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time. We further show, using several datasets, that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index [Grabowski and Raniszewski, Softw. Pract. Exp. 2017]. Finally, we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top-K similarity search under edit distance. We show, using synthetic datasets, that our heuristic is more accurate and more than one order of magnitude faster in top-K similarity searches than the state-of-the-art tool for the same purpose [Zhang and Zhang, KDD 2020].

Cite as

Grigorios Loukides and Solon P. Pissis. Bidirectional String Anchors: A New String Sampling Mechanism. In 29th Annual European Symposium on Algorithms (ESA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 204, pp. 64:1-64:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{loukides_et_al:LIPIcs.ESA.2021.64,
  author =	{Loukides, Grigorios and Pissis, Solon P.},
  title =	{{Bidirectional String Anchors: A New String Sampling Mechanism}},
  booktitle =	{29th Annual European Symposium on Algorithms (ESA 2021)},
  pages =	{64:1--64:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-204-4},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{204},
  editor =	{Mutzel, Petra and Pagh, Rasmus and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2021.64},
  URN =		{urn:nbn:de:0030-drops-146456},
  doi =		{10.4230/LIPIcs.ESA.2021.64},
  annote =	{Keywords: string algorithms, string sampling, text indexing, top-K similarity search}
}

Document

DOI: 10.4230/LIPIcs.ESA.2021.77

Hardness of Detecting Abelian and Additive Square Factors in Strings

Authors: Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba

Published in: LIPIcs, Volume 204, 29th Annual European Symposium on Algorithms (ESA 2021)

Abstract

We prove 3SUM-hardness (no strongly subquadratic-time algorithm, assuming the 3SUM conjecture) of several problems related to finding Abelian square and additive square factors in a string. In particular, we conclude conditional optimality of the state-of-the-art algorithms for finding such factors. Overall, we show 3SUM-hardness of (a) detecting an Abelian square factor of an odd half-length, (b) computing centers of all Abelian square factors, (c) detecting an additive square factor in a length-n string of integers of magnitude n^{𝒪(1)}, and (d) a problem of computing a double 3-term arithmetic progression (i.e., finding indices i ≠ j such that (x_i+x_j)/2 = x_{(i+j)/2}) in a sequence of integers x₁,… ,x_n of magnitude n^{𝒪(1)}. Problem (d) is essentially a convolution version of the AVERAGE problem that was proposed in a manuscript of Erickson. We obtain a conditional lower bound for it with the aid of techniques recently developed by Dudek et al. [STOC 2020]. Problem (d) immediately reduces to problem (c) and is a step in reductions to problems (a) and (b). In conditional lower bounds for problems (a) and (b) we apply an encoding of Amir et al. [ICALP 2014] and extend it using several string gadgets that include arbitrarily long Abelian-square-free strings. Our reductions also imply conditional lower bounds for detecting Abelian squares in strings over a constant-sized alphabet. We also show a subquadratic upper bound in this case, applying a result of Chan and Lewenstein [STOC 2015].

Cite as

Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. Hardness of Detecting Abelian and Additive Square Factors in Strings. In 29th Annual European Symposium on Algorithms (ESA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 204, pp. 77:1-77:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{radoszewski_et_al:LIPIcs.ESA.2021.77,
  author =	{Radoszewski, Jakub and Rytter, Wojciech and Straszy\'{n}ski, Juliusz and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Hardness of Detecting Abelian and Additive Square Factors in Strings}},
  booktitle =	{29th Annual European Symposium on Algorithms (ESA 2021)},
  pages =	{77:1--77:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-204-4},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{204},
  editor =	{Mutzel, Petra and Pagh, Rasmus and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2021.77},
  URN =		{urn:nbn:de:0030-drops-146581},
  doi =		{10.4230/LIPIcs.ESA.2021.77},
  annote =	{Keywords: Abelian square, additive square, 3SUM problem}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.MFCS.2021.5

Sublinear Algorithms for Edit Distance (Invited Talk)

Authors: Barna Saha

Published in: LIPIcs, Volume 202, 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021)

Abstract

The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length n in O(n²) time, and a more sophisticated algorithm runs in time O(n+t²) where t is the distance (Landau, Myers and Schmidt, SICOMP 1998). In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance, including polylogarithmic approximation in near-linear time (Andoni, Krauthgamer and Onak, FOCS 2010), and a constant-factor approximation in subquadratic time (Chakrabarty, Das, Goldenberg, Koucký and Saks, FOCS 2018). In this talk, we will discuss recent progress that goes beyond linear time, and studies sublinear time algorithms for edit distance. We will also discuss the role preprocessing might play in designing fast algorithms. This is a joint work with Elazar Goldenberg, Tomasz Kociumaka, Robert Krauthgamer, and Aviad Rubinstein.

Cite as

Barna Saha. Sublinear Algorithms for Edit Distance (Invited Talk). In 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 202, p. 5:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{saha:LIPIcs.MFCS.2021.5,
  author =	{Saha, Barna},
  title =	{{Sublinear Algorithms for Edit Distance}},
  booktitle =	{46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021)},
  pages =	{5:1--5:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-201-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{202},
  editor =	{Bonchi, Filippo and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2021.5},
  URN =		{urn:nbn:de:0030-drops-144452},
  doi =		{10.4230/LIPIcs.MFCS.2021.5},
  annote =	{Keywords: Edit distance, sublinear algorithms, string processing}
}

Document

DOI: 10.4230/LIPIcs.CPM.2021.5

The k-Mappability Problem Revisited

Authors: Amihood Amir, Itai Boneh, and Eitan Kondratovsky

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

The k-mappability problem has two integers parameters m and k. For every subword of size m in a text S, we wish to report the number of indices in S in which the word occurs with at most k mismatches. The problem was lately tackled by Alzamel et al. [Mai Alzamel et al., 2018]. For a text with constant alphabet Σ and k ∈ O(1), they present an algorithm with linear space and O(nlog^{k+1}n) time. For the case in which k = 1 and a constant size alphabet, a faster algorithm with linear space and O(nlog(n)log log(n)) time was presented in [Mai Alzamel et al., 2020]. In this work, we enhance the techniques of [Mai Alzamel et al., 2020] to obtain an algorithm with linear space and O(n log(n)) time for k = 1. Our algorithm removes the constraint of the alphabet being of constant size. We also present linear algorithms for the case of k = 1, |Σ| ∈ O(1) and m = Ω(√n).

Cite as

Amihood Amir, Itai Boneh, and Eitan Kondratovsky. The k-Mappability Problem Revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 5:1-5:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{amir_et_al:LIPIcs.CPM.2021.5,
  author =	{Amir, Amihood and Boneh, Itai and Kondratovsky, Eitan},
  title =	{{The k-Mappability Problem Revisited}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{5:1--5:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.5},
  URN =		{urn:nbn:de:0030-drops-139566},
  doi =		{10.4230/LIPIcs.CPM.2021.5},
  annote =	{Keywords: Pattern Matching, Hamming Distance, Suffix Tree, Suffix Array}
}

Document

DOI: 10.4230/LIPIcs.ESA.2020.39

Practical Performance of Space Efficient Data Structures for Longest Common Extensions

Authors: Patrick Dinklage, Johannes Fischer, Alexander Herlez, Tomasz Kociumaka, and Florian Kurpicz

Published in: LIPIcs, Volume 173, 28th Annual European Symposium on Algorithms (ESA 2020)

Abstract

For a text T[1,n], a Longest Common Extension (LCE) query lce_T(i,j) asks for the length of the longest common prefix of the suffixes T[i,n] and T[j,n] identified by their starting positions 1 ≤ i,j ≤ n. A classic problem in stringology asks to preprocess a static text T[1,n] over an alphabet of size σ so that LCE queries can be efficiently answered on-line. Since its introduction in the 1980’s, this problem has found numerous applications: in suffix sorting, edit distance computation, approximate pattern matching, regularities finding, string mining, and many more. Text-book solutions offer O(n) preprocessing time and O(1) query time, but they employ memory-heavy data structures, such as suffix arrays, in practice several times bigger than the text itself. Very recently, more space efficient solutions using O(nlogσ) bits of total space or even only O(log n) bits of extra space have been proposed: string synchronizing sets [Kempa and Kociumaka, STOC'19, and Birenzwige et al., SODA'20] and in-place fingerprinting [Prezza, SODA'18]. The goal of this article is to present well-engineered implementations of these new solutions and study their practicality on a commonly agreed text corpus. We show that both perform extremely well in practice, with space consumption of only around 10% of the input size for string synchronizing sets (around 20% for highly repetitive texts), and essentially no extra space for fingerprinting. Interestingly, our experiments also show that both solutions become much faster than naive scanning even for finding common prefixes of moderate length, contradicting a common belief that sophisticated data structures for LCE queries are not competitive with naive approaches [Ilie and Tinta, SPIRE'09].

Cite as

Patrick Dinklage, Johannes Fischer, Alexander Herlez, Tomasz Kociumaka, and Florian Kurpicz. Practical Performance of Space Efficient Data Structures for Longest Common Extensions. In 28th Annual European Symposium on Algorithms (ESA 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 173, pp. 39:1-39:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{dinklage_et_al:LIPIcs.ESA.2020.39,
  author =	{Dinklage, Patrick and Fischer, Johannes and Herlez, Alexander and Kociumaka, Tomasz and Kurpicz, Florian},
  title =	{{Practical Performance of Space Efficient Data Structures for Longest Common Extensions}},
  booktitle =	{28th Annual European Symposium on Algorithms (ESA 2020)},
  pages =	{39:1--39:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-162-7},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{173},
  editor =	{Grandoni, Fabrizio and Herman, Grzegorz and Sanders, Peter},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2020.39},
  URN =		{urn:nbn:de:0030-drops-129050},
  doi =		{10.4230/LIPIcs.ESA.2020.39},
  annote =	{Keywords: text indexing, longest common prefix, space efficient data structures}
}

37 Search Results for "Kociumaka, Tomasz"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message