DROPS

Document

DOI: 10.4230/LIPIcs.CPM.2024.4

Internal Pattern Matching in Small Space and Applications

Authors: Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Abstract

In this work, we consider pattern matching variants in small space, that is, in the read-only setting, where we want to bound the space usage on top of storing the strings. Our main contribution is a space-time trade-off for the Internal Pattern Matching (IPM) problem, where the goal is to construct a data structure over a string S of length n that allows one to answer the following type of queries: Compute the occurrences of a fragment P of S inside another fragment T of S, provided that |T| < 2|P|. For any τ ∈ [1 . . n/log² n], we present a nearly-optimal Õ(n/τ)-size data structure that can be built in Õ(n) time using Õ(n/τ) extra space, and answers IPM queries in O(τ+log n log³ log n) time. IPM queries have been identified as a crucial primitive operation for the analysis of algorithms on strings. In particular, the complexities of several recent algorithms for approximate pattern matching are expressed with regards to the number of calls to a small set of primitive operations that include IPM queries; our data structure allows us to port these results to the small-space setting. We further showcase the applicability of our IPM data structure by using it to obtain space-time trade-offs for the longest common substring and circular pattern matching problems in the asymmetric streaming setting.

Cite as

Gabriel Bathie, Panagiotis Charalampopoulos, and Tatiana Starikovskaya. Internal Pattern Matching in Small Space and Applications. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 4:1-4:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{bathie_et_al:LIPIcs.CPM.2024.4,
  author =	{Bathie, Gabriel and Charalampopoulos, Panagiotis and Starikovskaya, Tatiana},
  title =	{{Internal Pattern Matching in Small Space and Applications}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{4:1--4:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.4},
  URN =		{urn:nbn:de:0030-drops-201148},
  doi =		{10.4230/LIPIcs.CPM.2024.4},
  annote =	{Keywords: internal pattern matching, longest common substring, small-space algorithms}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2023.10

Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

Authors: Gabriel Bathie, Tomasz Kociumaka, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 283, 34th International Symposium on Algorithms and Computation (ISAAC 2023)

Abstract

We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language L, we are given a string T of length n, and the task is to compute the minimal distance to L from every prefix of T. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold k. In this work, our contribution is twofold: 1) First, we show streaming algorithms, which access the input string T only through a single left-to-right scan. Both for palindromes and squares, our algorithms use O(k polylog n) space and time per character in the Hamming-distance case and O(k² polylog n) space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in n. 2) Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of T. Both for palindromes and squares, our algorithms use O(k polylog n) space and time per character in the Hamming-distance case and O(k⁴ polylog n) space and amortised time per character in the edit-distance case.

Cite as

Gabriel Bathie, Tomasz Kociumaka, and Tatiana Starikovskaya. Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares. In 34th International Symposium on Algorithms and Computation (ISAAC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 283, pp. 10:1-10:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{bathie_et_al:LIPIcs.ISAAC.2023.10,
  author =	{Bathie, Gabriel and Kociumaka, Tomasz and Starikovskaya, Tatiana},
  title =	{{Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares}},
  booktitle =	{34th International Symposium on Algorithms and Computation (ISAAC 2023)},
  pages =	{10:1--10:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-289-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{283},
  editor =	{Iwata, Satoru and Kakimura, Naonori},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2023.10},
  URN =		{urn:nbn:de:0030-drops-193124},
  doi =		{10.4230/LIPIcs.ISAAC.2023.10},
  annote =	{Keywords: Approximate pattern matching, streaming algorithms, palindromes, squares}
}

Document

DOI: 10.4230/LIPIcs.CPM.2023.12

Compressed Indexing for Consecutive Occurrences

Authors: Paweł Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern P. However, practical applications motivate the necessity of considering more complex queries, for example concerning near occurrences of two patterns. Recently, Bille et al. [CPM 2021] introduced a variant of such queries, called gapped consecutive occurrences, in which a query consists of two patterns P₁ and P₂ and a range [a,b], and one must find all consecutive occurrences (q₁,q₂) of P₁ and P₂ such that q₂-q₁ ∈ [a,b]. By their results, we cannot hope for a very efficient indexing structure for such queries, even if a = 0 is fixed (although at the same time they provided a non-trivial upper bound). Motivated by this, we focus on a text given as a straight-line program (SLP) and design an index taking space polynomial in the size of the grammar that answers such queries in time optimal up to polylog factors.

Cite as

Paweł Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner. Compressed Indexing for Consecutive Occurrences. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 12:1-12:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.CPM.2023.12,
  author =	{Gawrychowski, Pawe{\l} and Gourdel, Garance and Starikovskaya, Tatiana and Steiner, Teresa Anna},
  title =	{{Compressed Indexing for Consecutive Occurrences}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{12:1--12:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.12},
  URN =		{urn:nbn:de:0030-drops-179666},
  doi =		{10.4230/LIPIcs.CPM.2023.12},
  annote =	{Keywords: Compressed indexing, two patterns, consecutive occurrences}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.ISAAC.2021.1

Streaming Pattern Matching (Invited Talk)

Authors: Tatiana Starikovskaya

Published in: LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)

Abstract

Many classical algorithms for string processing assume that the input can be accessed in full via constant-time random access, which poses a serious limitation in the modern era of data deluge. In this talk, we will focus on the streaming model of computation that allows to overcome this issue. In this model of computation, we assume that the input arrives as a stream, one character at a time, which captures a situation when the data are sequential measurements or an output of an algorithm. The space complexity is defined as all the space used, including the space used to store any information about the input, which allows to develop ultra-efficient algorithms. The first streaming algorithm for pattern matching was presented in the seminal paper of Porat and Porat in FOCS 2009. For a pattern of length m, the algorithm uses only O(log m) space, while any classical algorithm requires Ω(m) space. This result served as a foundation of the area of streaming algorithms for pattern matching. After a brief survey of the area, we will discuss two questions in more details: the k-mismatch problem and the pattern matching with k-edits problem. In the k-mismatch problem, one is given a pattern and a text, and the task is to find all substrings of the text that have at most k mismatches with the pattern. The current best algorithm for this problem was given by Clifford, Kociumaka, and Porat in SODA 2019, and for a pattern of length m it uses O(k log m) space and Õ(√k) time per character of the text. In the pattern matching with k-edits problem, the task is similar, but one must find substrings that can be transformed into the pattern by at most k edits, i.e. substitutions, insertions, and deletions of a character. For this problem, the first streaming algorithm was presented by Kociumaka, Porat, and Starikovskaya in FOCS 2021. The algorithm takes Õ(poly(k)) space and Õ(poly(k)) time per character of the text.

Cite as

Tatiana Starikovskaya. Streaming Pattern Matching (Invited Talk). In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{starikovskaya:LIPIcs.ISAAC.2021.1,
  author =	{Starikovskaya, Tatiana},
  title =	{{Streaming Pattern Matching}},
  booktitle =	{32nd International Symposium on Algorithms and Computation (ISAAC 2021)},
  pages =	{1:1--1:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-214-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{212},
  editor =	{Ahn, Hee-Kap and Sadakane, Kunihiko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.1},
  URN =		{urn:nbn:de:0030-drops-154345},
  doi =		{10.4230/LIPIcs.ISAAC.2021.1},
  annote =	{Keywords: Streaming algorithms, Pattern matching, Hamming distance, Edit distance}
}

Document

Track B: Automata, Logic, Semantics, and Theory of Programming

DOI: 10.4230/LIPIcs.ICALP.2021.119

Property Testing of Regular Languages with Applications to Streaming Property Testing of Visibly Pushdown Languages

Authors: Gabriel Bathie and Tatiana Starikovskaya

Published in: LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

Abstract

In this work, we revisit the problem of testing membership in regular languages, first studied by Alon et al. [Alon et al., 2001]. We develop a one-sided error property tester for regular languages under weighted edit distance that makes 𝒪(ε^{-1} log(1/ε)) non-adaptive queries, assuming that the language is described by an automaton of constant size. Moreover, we show a matching lower bound, essentially closing the problem for the edit distance. As an application, we improve the space bound of the current best streaming property testing algorithm for visibly pushdown languages from 𝒪(ε^{-4} log⁶ n) to 𝒪(ε^{-3} log⁵ n log log n), where n is the size of the input. Finally, we provide a Ω(max(ε^{-1}, log n)) lower bound on the memory necessary to test visibly pushdown languages in the streaming model, significantly narrowing the gap between the known bounds.

Cite as

Gabriel Bathie and Tatiana Starikovskaya. Property Testing of Regular Languages with Applications to Streaming Property Testing of Visibly Pushdown Languages. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 119:1-119:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{bathie_et_al:LIPIcs.ICALP.2021.119,
  author =	{Bathie, Gabriel and Starikovskaya, Tatiana},
  title =	{{Property Testing of Regular Languages with Applications to Streaming Property Testing of Visibly Pushdown Languages}},
  booktitle =	{48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)},
  pages =	{119:1--119:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-195-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{198},
  editor =	{Bansal, Nikhil and Merelli, Emanuela and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.119},
  URN =		{urn:nbn:de:0030-drops-141881},
  doi =		{10.4230/LIPIcs.ICALP.2021.119},
  annote =	{Keywords: property testing, regular languages, streaming algorithms, visibly pushdown languages}
}

Document

Complete Volume

DOI: 10.4230/LIPIcs.CPM.2021

LIPIcs, Volume 191, CPM 2021, Complete Volume

Authors: Paweł Gawrychowski and Tatiana Starikovskaya

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

LIPIcs, Volume 191, CPM 2021, Complete Volume

Cite as

32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 1-404, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@Proceedings{gawrychowski_et_al:LIPIcs.CPM.2021,
  title =	{{LIPIcs, Volume 191, CPM 2021, Complete Volume}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{1--404},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021},
  URN =		{urn:nbn:de:0030-drops-139509},
  doi =		{10.4230/LIPIcs.CPM.2021},
  annote =	{Keywords: LIPIcs, Volume 191, CPM 2021, Complete Volume}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.CPM.2021.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Paweł Gawrychowski and Tatiana Starikovskaya

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 0:i-0:xiv, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.CPM.2021.0,
  author =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{0:i--0:xiv},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.0},
  URN =		{urn:nbn:de:0030-drops-139512},
  doi =		{10.4230/LIPIcs.CPM.2021.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

Document

APPROX

DOI: 10.4230/LIPIcs.APPROX/RANDOM.2020.35

L_p Pattern Matching in a Stream

Authors: Tatiana Starikovskaya, Michal Svagerka, and Przemysław Uznański

Published in: LIPIcs, Volume 176, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)

Abstract

We consider the problem of computing distance between a pattern of length n and all n-length subwords of a text in the streaming model. In the streaming setting, only the Hamming distance (L₀) has been studied. It is known that computing the exact Hamming distance between a pattern and a streaming text requires Ω(n) space (folklore). Therefore, to develop sublinear-space solutions, one must relax their requirements. One possibility to do so is to compute only the distances bounded by a threshold k, see [SODA'19, Clifford, Kociumaka, Porat] and references therein. The motivation for this variant of this problem is that we are interested in subwords of the text that are similar to the pattern, i.e. in subwords such that the distance between them and the pattern is relatively small. On the other hand, the main application of the streaming setting is processing large-scale data, such as biological data. Recent advances in hardware technology allow generating such data at a very high speed, but unfortunately, the produced data may contain about 10% of noise [Biol. Direct.'07, Klebanov and Yakovlev]. To analyse such data, it is not sufficient to consider small distances only. A possible workaround for this issue is the (1±ε)-approximation. This line of research was initiated in [ICALP'16, Clifford and Starikovskaya] who gave a (1±ε)-approximation algorithm with space 𝒪~(ε^{-5}√n). In this work, we show a suite of new streaming algorithms for computing the Hamming, L₁, L₂ and general L_p (0 < p < 2) distances between the pattern and the text. Our results significantly extend over the previous result in this setting. In particular, for the Hamming distance and for the L_p distance when 0 < p ≤ 1 we show a streaming algorithm that uses 𝒪~(ε^{-2}√n) space for polynomial-size alphabets.

Cite as

Tatiana Starikovskaya, Michal Svagerka, and Przemysław Uznański. L_p Pattern Matching in a Stream. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 176, pp. 35:1-35:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{starikovskaya_et_al:LIPIcs.APPROX/RANDOM.2020.35,
  author =	{Starikovskaya, Tatiana and Svagerka, Michal and Uzna\'{n}ski, Przemys{\l}aw},
  title =	{{L\underlinep Pattern Matching in a Stream}},
  booktitle =	{Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020)},
  pages =	{35:1--35:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-164-1},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{176},
  editor =	{Byrka, Jaros{\l}aw and Meka, Raghu},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.APPROX/RANDOM.2020.35},
  URN =		{urn:nbn:de:0030-drops-126381},
  doi =		{10.4230/LIPIcs.APPROX/RANDOM.2020.35},
  annote =	{Keywords: streaming algorithms, approximate pattern matching}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.16

Approximating Longest Common Substring with k mismatches: Theory and Practice

Authors: Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length 𝓁 such that there is a length-𝓁 substring of X and a length-𝓁 substring of Y that differ in at most k positions. The length 𝓁 can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing 𝓁 that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

Cite as

Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Approximating Longest Common Substring with k mismatches: Theory and Practice. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{gourdel_et_al:LIPIcs.CPM.2020.16,
  author =	{Gourdel, Garance and Kociumaka, Tomasz and Radoszewski, Jakub and Starikovskaya, Tatiana},
  title =	{{Approximating Longest Common Substring with k mismatches: Theory and Practice}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{16:1--16:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.16},
  URN =		{urn:nbn:de:0030-drops-121410},
  doi =		{10.4230/LIPIcs.CPM.2020.16},
  annote =	{Keywords: approximation algorithms, string similarity, LSH, conditional lower bounds}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.23

Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP

Authors: Kotaro Matsuda, Kunihiko Sadakane, Tatiana Starikovskaya, and Masakazu Tateshita

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

We propose a space-efficient data structure for orthogonal range search on suffix arrays. For general two-dimensional orthogonal range search problem on a set of n points, there exists an n log n (1+o(1))-bit data structure supporting O(log n)-time counting queries [Mäkinen, Navarro 2007]. The space matches the information-theoretic lower bound. However, if we focus on a point set representing a suffix array, there is a chance to obtain a space efficient data structure. We answer this question affirmatively. Namely, we propose a data structure for orthogonal range search on suffix arrays which uses O(1/(ε) n (H₀+1)) bits where H₀ is the order-0 entropy of the string and answers a counting query in O(n^ε) time for any constant ε>0. As an application, we give an O(1/(ε) n (H₀+1))-bit data structure for the range LCP problem.

Cite as

Kotaro Matsuda, Kunihiko Sadakane, Tatiana Starikovskaya, and Masakazu Tateshita. Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{matsuda_et_al:LIPIcs.CPM.2020.23,
  author =	{Matsuda, Kotaro and Sadakane, Kunihiko and Starikovskaya, Tatiana and Tateshita, Masakazu},
  title =	{{Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{23:1--23:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.23},
  URN =		{urn:nbn:de:0030-drops-121482},
  doi =		{10.4230/LIPIcs.CPM.2020.23},
  annote =	{Keywords: Orthogonal Range Search, Succinct Data Structure, Suffix Array}
}

Document

DOI: 10.4230/LIPIcs.STACS.2020.18

Generalised Pattern Matching Revisited

Authors: Bartłomiej Dudek, Paweł Gawrychowski, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 154, 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)

Abstract

In the problem of Generalised Pattern Matching (GPM) [STOC'94, Muthukrishnan and Palem], we are given a text T of length n over an alphabet Σ_T, a pattern P of length m over an alphabet Σ_P, and a matching relationship ⊆ Σ_T × Σ_P, and must return all substrings of T that match P (reporting) or the number of mismatches between each substring of T of length m and P (counting). In this work, we improve over all previously known algorithms for this problem: - For ? being the maximum number of characters that match a fixed character, we show two new Monte Carlo algorithms, a reporting algorithm with time ?(? n log n log m) and a (1-ε)-approximation counting algorithm with time ?(ε^-1 ? n log n log m). We then derive a (1-ε)-approximation deterministic counting algorithm for GPM with ?(ε^-2 ? n log⁶ n) time. - For ? being the number of pairs of matching characters, we demonstrate Monte Carlo algorithms for reporting and (1-ε)-approximate counting with running time ?(√? n log m √{log n}) and ?(√{ε^-1 ?} n log m √{log n}), respectively, as well as a (1-ε)-approximation deterministic algorithm for the counting variant of GPM with ?(ε^-1 √{?} n log^{7/2} n) time. - Finally, for ℐ being the total number of disjoint intervals of characters that match the m characters of the pattern P, we show that both the reporting and the counting variants of GPM can be solved exactly and deterministically in ?(n√{ℐ log m} +n log n) time. At the heart of our new deterministic upper bounds for ? and ? lies a faster construction of superimposed codes, which solves an open problem posed in [FOCS'97, Indyk] and can be of independent interest. To conclude, we demonstrate first lower bounds for GPM. We start by showing that any deterministic or Monte Carlo algorithm for GPM must use Ω(?) time, and then proceed to show higher lower bounds for combinatorial algorithms. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed.

Cite as

Bartłomiej Dudek, Paweł Gawrychowski, and Tatiana Starikovskaya. Generalised Pattern Matching Revisited. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 18:1-18:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{dudek_et_al:LIPIcs.STACS.2020.18,
  author =	{Dudek, Bart{\l}omiej and Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  title =	{{Generalised Pattern Matching Revisited}},
  booktitle =	{37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)},
  pages =	{18:1--18:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-140-5},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{154},
  editor =	{Paul, Christophe and Bl\"{a}ser, Markus},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2020.18},
  URN =		{urn:nbn:de:0030-drops-118798},
  doi =		{10.4230/LIPIcs.STACS.2020.18},
  annote =	{Keywords: pattern matching, superimposed codes, conditional lower bounds}
}

Document

DOI: 10.4230/LIPIcs.ISAAC.2019.6

Sliding Window Property Testing for Regular Languages

Authors: Moses Ganardi, Danny Hucke, Markus Lohrey, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 149, 30th International Symposium on Algorithms and Computation (ISAAC 2019)

Abstract

We study the problem of recognizing regular languages in a variant of the streaming model of computation, called the sliding window model. In this model, we are given a size of the sliding window n and a stream of symbols. At each time instant, we must decide whether the suffix of length n of the current stream ("the active window") belongs to a given regular language. Recent works [Moses Ganardi et al., 2018; Moses Ganardi et al., 2016] showed that the space complexity of an optimal deterministic sliding window algorithm for this problem is either constant, logarithmic or linear in the window size n and provided natural language theoretic characterizations of the space complexity classes. Subsequently, [Moses Ganardi et al., 2018] extended this result to randomized algorithms to show that any such algorithm admits either constant, double logarithmic, logarithmic or linear space complexity. In this work, we make an important step forward and combine the sliding window model with the property testing setting, which results in ultra-efficient algorithms for all regular languages. Informally, a sliding window property tester must accept the active window if it belongs to the language and reject it if it is far from the language. We show that for every regular language, there is a deterministic sliding window property tester that uses logarithmic space and a randomized sliding window property tester with two-sided error that uses constant space.

Cite as

Moses Ganardi, Danny Hucke, Markus Lohrey, and Tatiana Starikovskaya. Sliding Window Property Testing for Regular Languages. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 6:1-6:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{ganardi_et_al:LIPIcs.ISAAC.2019.6,
  author =	{Ganardi, Moses and Hucke, Danny and Lohrey, Markus and Starikovskaya, Tatiana},
  title =	{{Sliding Window Property Testing for Regular Languages}},
  booktitle =	{30th International Symposium on Algorithms and Computation (ISAAC 2019)},
  pages =	{6:1--6:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-130-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{149},
  editor =	{Lu, Pinyan and Zhang, Guochuan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2019.6},
  URN =		{urn:nbn:de:0030-drops-115023},
  doi =		{10.4230/LIPIcs.ISAAC.2019.6},
  annote =	{Keywords: Streaming algorithms, approximation algorithms, regular languages}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.21

Streaming Dictionary Matching with Mismatches

Authors: Paweł Gawrychowski and Tatiana Starikovskaya

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

In the k-mismatch problem we are given a pattern of length m and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only O(k log m/k) space [Clifford, Kociumaka, Porat, SODA 2019]. In this work, we consider a strictly harder problem called dictionary matching with k mismatches, where we are given a dictionary of d patterns of lengths at most m and must find all their k-mismatch occurrences in the text, and show the first streaming algorithm for it. The algorithm uses O(k d log^k d polylog m) space and processes each position of the text in O(k log^k d polylog m + occ) time, where occ is the number of k-mismatch occurrences of the patterns that end at this position. The algorithm is randomised and outputs correct answers with high probability.

Cite as

Paweł Gawrychowski and Tatiana Starikovskaya. Streaming Dictionary Matching with Mismatches. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 21:1-21:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.CPM.2019.21,
  author =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  title =	{{Streaming Dictionary Matching with Mismatches}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{21:1--21:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.21},
  URN =		{urn:nbn:de:0030-drops-104925},
  doi =		{10.4230/LIPIcs.CPM.2019.21},
  annote =	{Keywords: Streaming, multiple pattern matching, Hamming distance}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.22

Quasi-Periodicity in Streams

Authors: Paweł Gawrychowski, Jakub Radoszewski, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

In this work, we show two streaming algorithms for computing the length of the shortest cover of a string of length n. We start by showing a two-pass algorithm that uses O(log^2 n) space and then show a one-pass streaming algorithm that uses O(sqrt{n log n}) space. Both algorithms run in near-linear time. The algorithms are randomized and compute the answer incorrectly with probability inverse-polynomial in n. We also show that there is no sublinear-space streaming algorithm for computing the length of the shortest seed of a string.

Cite as

Paweł Gawrychowski, Jakub Radoszewski, and Tatiana Starikovskaya. Quasi-Periodicity in Streams. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.CPM.2019.22,
  author =	{Gawrychowski, Pawe{\l} and Radoszewski, Jakub and Starikovskaya, Tatiana},
  title =	{{Quasi-Periodicity in Streams}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{22:1--22:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.22},
  URN =		{urn:nbn:de:0030-drops-104930},
  doi =		{10.4230/LIPIcs.CPM.2019.22},
  annote =	{Keywords: Streaming algorithms, quasi-periodicity, covers, seeds}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2018.66

Fast Entropy-Bounded String Dictionary Look-Up with Mismatches

Authors: Pawel Gawrychowski, Gad M. Landau, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 117, 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018)

Abstract

We revisit the fundamental problem of dictionary look-up with mismatches. Given a set (dictionary) of d strings of length m and an integer k, we must preprocess it into a data structure to answer the following queries: Given a query string Q of length m, find all strings in the dictionary that are at Hamming distance at most k from Q. Chan and Lewenstein (CPM 2015) showed a data structure for k = 1 with optimal query time O(m/w + occ), where w is the size of a machine word and occ is the size of the output. The data structure occupies O(w d log^{1+epsilon} d) extra bits of space (beyond the entropy-bounded space required to store the dictionary strings). In this work we give a solution with similar bounds for a much wider range of values k. Namely, we give a data structure that has O(m/w + log^k d + occ) query time and uses O(w d log^k d) extra bits of space.

Cite as

Pawel Gawrychowski, Gad M. Landau, and Tatiana Starikovskaya. Fast Entropy-Bounded String Dictionary Look-Up with Mismatches. In 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 117, pp. 66:1-66:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.MFCS.2018.66,
  author =	{Gawrychowski, Pawel and Landau, Gad M. and Starikovskaya, Tatiana},
  title =	{{Fast Entropy-Bounded String Dictionary Look-Up with Mismatches}},
  booktitle =	{43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018)},
  pages =	{66:1--66:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-086-6},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{117},
  editor =	{Potapov, Igor and Spirakis, Paul and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2018.66},
  URN =		{urn:nbn:de:0030-drops-96486},
  doi =		{10.4230/LIPIcs.MFCS.2018.66},
  annote =	{Keywords: Dictionary look-up, Hamming distance, compact data structures}
}

Document

DOI: 10.4230/LIPIcs.STACS.2018.22

Upper and Lower Bounds for Dynamic Data Structures on Strings

Authors: Raphaël Clifford, Allan Grønlund, Kasper Green Larsen, and Tatiana Starikovskaya

Published in: LIPIcs, Volume 96, 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018)

Abstract

We consider a range of simply stated dynamic data structure problems on strings. An update changes one symbol in the input and a query asks us to compute some function of the pattern of length m and a substring of a longer text. We give both conditional and unconditional lower bounds for variants of exact matching with wildcards, inner product, and Hamming distance computation via a sequence of reductions. As an example, we show that there does not exist an O(m^{1/2-epsilon}) time algorithm for a large range of these problems unless the online Boolean matrix-vector multiplication conjecture is false. We also provide nearly matching upper bounds for most of the problems we consider.

Cite as

Raphaël Clifford, Allan Grønlund, Kasper Green Larsen, and Tatiana Starikovskaya. Upper and Lower Bounds for Dynamic Data Structures on Strings. In 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 96, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{clifford_et_al:LIPIcs.STACS.2018.22,
  author =	{Clifford, Rapha\"{e}l and Gr{\o}nlund, Allan and Larsen, Kasper Green and Starikovskaya, Tatiana},
  title =	{{Upper and Lower Bounds for Dynamic Data Structures on Strings}},
  booktitle =	{35th Symposium on Theoretical Aspects of Computer Science (STACS 2018)},
  pages =	{22:1--22:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-062-0},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{96},
  editor =	{Niedermeier, Rolf and Vall\'{e}e, Brigitte},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2018.22},
  URN =		{urn:nbn:de:0030-drops-85088},
  doi =		{10.4230/LIPIcs.STACS.2018.22},
  annote =	{Keywords: exact pattern matching with wildcards, hamming distance, inner product, conditional lower bounds}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.13

Communication and Streaming Complexity of Approximate Pattern Matching

Authors: Tatiana Starikovskaya

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

We consider the approximate pattern matching problem. Given a text T of length 2n and a pattern P of length n, the task is to decide for each prefix T[1, j] of T if it ends with a string that is at the edit distance at most k from P. If this is the case, we must output the edit distance and the corresponding edit operations. We first show the communication complexity of the problem for the case when Alice and Bob both share the pattern and Alice holds the first half of the text and Bob the second half, and for the case when Alice holds the first half of the text, Bob the second half of the text, and Charlie the pattern. We then develop the first sublinear-space streaming algorithm for the problem. The algorithm is randomised with error probability at most 1/poly(n).

Cite as

Tatiana Starikovskaya. Communication and Streaming Complexity of Approximate Pattern Matching. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 13:1-13:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{starikovskaya:LIPIcs.CPM.2017.13,
  author =	{Starikovskaya, Tatiana},
  title =	{{Communication and Streaming Complexity of Approximate Pattern Matching}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{13:1--13:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.13},
  URN =		{urn:nbn:de:0030-drops-73206},
  doi =		{10.4230/LIPIcs.CPM.2017.13},
  annote =	{Keywords: approximate pattern matching, edit distance, randomised algorithms, streaming algorithms, communication complexity}
}

Document

DOI: 10.4230/LIPIcs.ICALP.2016.20

Approximate Hamming Distance in a Stream

Authors: Raphaël Clifford and Tatiana Starikovskaya

Published in: LIPIcs, Volume 55, 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016)

Abstract

We consider the problem of computing a (1+epsilon)-approximation of the Hamming distance between a pattern of length n and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem. We show the following: - If Alice and Bob both share the pattern and Alice has the first half of the stream and Bob the second half, then there is an O(epsilon^{-4}*log^2(n)) bit randomised one-way communication protocol. - If Alice has the pattern, Bob the first half of the stream and Charlie the second half, then there is an O(epsilon^{-2}*sqrt(n)*log(n)) bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for (1 + epsilon)-approximate Hamming distance which give worst case running time guarantees per arriving symbol. - For binary input alphabets there is an O(epsilon^{-3}*sqrt(n)*log^2(n)) space and O(epsilon^{-2}*log(n)) time streaming (1 + epsilon)-approximate Hamming distance algorithm. - For general input alphabets there is an O(epsilon^{-5}*sqrt(n)*log^4(n)) space and O(epsilon^{-4}*log^3(n)) time streaming (1 + epsilon)-approximate Hamming distance algorithm.

Cite as

Raphaël Clifford and Tatiana Starikovskaya. Approximate Hamming Distance in a Stream. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 55, pp. 20:1-20:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{clifford_et_al:LIPIcs.ICALP.2016.20,
  author =	{Clifford, Rapha\"{e}l and Starikovskaya, Tatiana},
  title =	{{Approximate Hamming Distance in a Stream}},
  booktitle =	{43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016)},
  pages =	{20:1--20:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-013-2},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{55},
  editor =	{Chatzigiannakis, Ioannis and Mitzenmacher, Michael and Rabani, Yuval and Sangiorgi, Davide},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2016.20},
  URN =		{urn:nbn:de:0030-drops-62992},
  doi =		{10.4230/LIPIcs.ICALP.2016.20},
  annote =	{Keywords: Hamming distance, communication complexity, data stream model}
}

Document

DOI: 10.4230/LIPIcs.CPM.2016.21

Longest Common Substring with Approximately k Mismatches

Authors: Tatiana Starikovskaya

Published in: LIPIcs, Volume 54, 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)

Abstract

In the longest common substring problem we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well-known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one letter. To circumvent this, Leimester and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature, and several algorithms have been suggested. The running time of these algorithms is n^{2-o(1)}, and unfortunately, conditional lower bounds have been shown which imply that there is little hope to improve this bound. In this paper we study a different but closely related problem of the longest common substring with approximately k mismatches and use computational geometry techniques to show that it admits a randomised solution with strongly subquadratic running time.

Cite as

Tatiana Starikovskaya. Longest Common Substring with Approximately k Mismatches. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 21:1-21:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{starikovskaya:LIPIcs.CPM.2016.21,
  author =	{Starikovskaya, Tatiana},
  title =	{{Longest Common Substring with Approximately k Mismatches}},
  booktitle =	{27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)},
  pages =	{21:1--21:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-012-5},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{54},
  editor =	{Grossi, Roberto and Lewenstein, Moshe},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2016.21},
  URN =		{urn:nbn:de:0030-drops-60720},
  doi =		{10.4230/LIPIcs.CPM.2016.21},
  annote =	{Keywords: Randomised algorithms, string similarity measures, longest common substring, sketching, locality-sensitive hashing}
}

Search Results

Documents authored by Starikovskaya, Tatiana

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as