31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Document

Complete Volume

DOI: 10.4230/LIPIcs.CPM.2020

LIPIcs, Volume 161, CPM 2020, Complete Volume

Authors: Inge Li Gørtz and Oren Weimann

Abstract

LIPIcs, Volume 161, CPM 2020, Complete Volume

Cite as

31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 1-418, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@Proceedings{grtz_et_al:LIPIcs.CPM.2020,
  title =	{{LIPIcs, Volume 161, CPM 2020, Complete Volume}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{1--418},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020},
  URN =		{urn:nbn:de:0030-drops-121245},
  doi =		{10.4230/LIPIcs.CPM.2020},
  annote =	{Keywords: LIPIcs, Volume 161, CPM 2020, Complete Volume}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.CPM.2020.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Inge Li Gørtz and Oren Weimann

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{grtz_et_al:LIPIcs.CPM.2020.0,
  author =	{G{\o}rtz, Inge Li and Weimann, Oren},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{0:i--0:xvi},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.0},
  URN =		{urn:nbn:de:0030-drops-121252},
  doi =		{10.4230/LIPIcs.CPM.2020.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.CPM.2020.1

Algebraic Algorithms for Finding Patterns in Graphs (Invited Talk)

Authors: Thore Husfeldt

Abstract

I will give a gentle introduction to algebraic graph algorithms by showing how to determine if a given graph contains a simple path of length k. This is a famous problem admitting a beautiful and widely-known algorithm, namely the colour-coding method of Alon, Yuster and Zwick (1995). Starting from this entirely combinatorial approach, I will carefully develop an algebraic perspective on the same problem. First, I will explain how the colour-coding algorithm can be understood as the evaluation of a well-known expression (sometimes called the "walk-sum" of the graph) in a commutative algebra called the zeon algebra. From there, I will introduce the exterior algebra and present the algebraic framework recently developed with Brand and Dell (2018). The presentation is aimed at a combinatorially-minded audience largely innocent of abstract algebra.

Cite as

Thore Husfeldt. Algebraic Algorithms for Finding Patterns in Graphs (Invited Talk). In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{husfeldt:LIPIcs.CPM.2020.1,
  author =	{Husfeldt, Thore},
  title =	{{Algebraic Algorithms for Finding Patterns in Graphs}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{1:1--1:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.1},
  URN =		{urn:nbn:de:0030-drops-121261},
  doi =		{10.4230/LIPIcs.CPM.2020.1},
  annote =	{Keywords: paths, exterior algebra, wedge product, color-coding, parameterized complexity}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.2

Finding the Anticover of a String

Authors: Mai Alzamel, Alessio Conte, Shuhei Denzumi, Roberto Grossi, Costas S. Iliopoulos, Kazuhiro Kurita, and Kunihiro Wasa

Abstract

A k-anticover of a string x is a set of pairwise distinct factors of x of equal length k, such that every symbol of x is contained into an occurrence of at least one of those factors. The existence of a k-anticover can be seen as a notion of non-redundancy, which has application in computational biology, where they are associated with various non-regulatory mechanisms. In this paper we address the complexity of the problem of finding a k-anticover of a string x if it exists, showing that the decision problem is NP-complete on general strings for k ≥ 3. We also show that the problem admits a polynomial-time solution for k=2. For unbounded k, we provide an exact exponential algorithm to find a k-anticover of a string of length n (or determine that none exists), which runs in O*(min {3^{(n-k)/3)}, ((k(k+1))/2)^{n/(k+1)) time using polynomial space.

Cite as

Mai Alzamel, Alessio Conte, Shuhei Denzumi, Roberto Grossi, Costas S. Iliopoulos, Kazuhiro Kurita, and Kunihiro Wasa. Finding the Anticover of a String. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 2:1-2:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{alzamel_et_al:LIPIcs.CPM.2020.2,
  author =	{Alzamel, Mai and Conte, Alessio and Denzumi, Shuhei and Grossi, Roberto and Iliopoulos, Costas S. and Kurita, Kazuhiro and Wasa, Kunihiro},
  title =	{{Finding the Anticover of a String}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{2:1--2:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.2},
  URN =		{urn:nbn:de:0030-drops-121270},
  doi =		{10.4230/LIPIcs.CPM.2020.2},
  annote =	{Keywords: Anticover, String algorithms, Stringology, NP-complete}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.3

Double String Tandem Repeats

Authors: Amihood Amir, Ayelet Butman, Gad M. Landau, Shoshana Marcus, and Dina Sokol

Abstract

A tandem repeat is an occurrence of two adjacent identical substrings. In this paper, we introduce the notion of a double string, which consists of two parallel strings, and we study the problem of locating all tandem repeats in a double string. The problem introduced here has applications beyond actual double strings, as we illustrate by solving two different problems with the algorithm of the double string tandem repeats problem. The first problem is that of finding all corner-sharing tandems in a 2-dimensional text, defined by Apostolico and Brimkov. The second problem is that of finding all scaled tandem repeats in a 1d text, where a scaled tandem repeat is defined as a string UU' such that U' is discrete scale of U. In addition to the algorithms for exact tandem repeats, we also present algorithms that solve the problem in the inexact sense, allowing up to k mismatches. We believe that this framework will open a new perspective for other problems in the future.

Cite as

Amihood Amir, Ayelet Butman, Gad M. Landau, Shoshana Marcus, and Dina Sokol. Double String Tandem Repeats. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 3:1-3:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{amir_et_al:LIPIcs.CPM.2020.3,
  author =	{Amir, Amihood and Butman, Ayelet and Landau, Gad M. and Marcus, Shoshana and Sokol, Dina},
  title =	{{Double String Tandem Repeats}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{3:1--3:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.3},
  URN =		{urn:nbn:de:0030-drops-121283},
  doi =		{10.4230/LIPIcs.CPM.2020.3},
  annote =	{Keywords: double string, tandem repeat, 2-dimensional, scale}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.4

Efficient Tree-Structured Categorical Retrieval

Authors: Djamal Belazzougui and Gregory Kucherov

Abstract

We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

Cite as

Djamal Belazzougui and Gregory Kucherov. Efficient Tree-Structured Categorical Retrieval. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2020.4,
  author =	{Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Efficient Tree-Structured Categorical Retrieval}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{4:1--4:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.4},
  URN =		{urn:nbn:de:0030-drops-121299},
  doi =		{10.4230/LIPIcs.CPM.2020.4},
  annote =	{Keywords: pattern matching, document retrieval, category tree, space-efficient data structures}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.5

Time-Space Tradeoffs for Finding a Long Common Substring

Authors: Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus

Abstract

We consider the problem of finding, given two documents of total length n, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic 𝒪(n)-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require Θ(n) space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for n^{2/3} ≤ s ≤ n, the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/s) time. Kociumaka et al. (ESA 2014) generalized this tradeoff to 1 ≤ s ≤ n, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length L of the sought LCS is large. For 1 ≤ s ≤ n, we show that the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/(L⋅s) +n) time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents.

Cite as

Stav Ben-Nun, Shay Golan, Tomasz Kociumaka, and Matan Kraus. Time-Space Tradeoffs for Finding a Long Common Substring. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 5:1-5:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bennun_et_al:LIPIcs.CPM.2020.5,
  author =	{Ben-Nun, Stav and Golan, Shay and Kociumaka, Tomasz and Kraus, Matan},
  title =	{{Time-Space Tradeoffs for Finding a Long Common Substring}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{5:1--5:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.5},
  URN =		{urn:nbn:de:0030-drops-121302},
  doi =		{10.4230/LIPIcs.CPM.2020.5},
  annote =	{Keywords: longest common substring, time-space tradeoff, local consistency, periodicity}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.6

On Two Measures of Distance Between Fully-Labelled Trees

Authors: Giulia Bernardini, Paola Bonizzoni, and Paweł Gawrychowski

Abstract

The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangement distance for fully-labelled trees motivated by this necessity. This notion originates from two operations: one that permutes the labels of the nodes, the other that affects the topology of the tree. Each operation alone defines a distance that can be computed in polynomial time, while the actual rearrangement distance, that combines the two, was proven to be NP-hard. We answer two open question left unanswered by the previous work. First, what is the complexity of computing the permutation distance? Second, is there a constant-factor approximation algorithm for estimating the rearrangement distance between two arbitrary trees? We answer the first one by showing, via a two-way reduction, that calculating the permutation distance between two trees on n nodes is equivalent, up to polylogarithmic factors, to finding the largest cardinality matching in a sparse bipartite graph. In particular, by plugging in the algorithm of Liu and Sidford [ArXiv 2020], we obtain an 𝒪̃(n^{4/3+o(1}) time algorithm for computing the permutation distance between two trees on n nodes. Then we answer the second question positively, and design a linear-time constant-factor approximation algorithm that does not need any assumption on the trees.

Cite as

Giulia Bernardini, Paola Bonizzoni, and Paweł Gawrychowski. On Two Measures of Distance Between Fully-Labelled Trees. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 6:1-6:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:LIPIcs.CPM.2020.6,
  author =	{Bernardini, Giulia and Bonizzoni, Paola and Gawrychowski, Pawe{\l}},
  title =	{{On Two Measures of Distance Between Fully-Labelled Trees}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{6:1--6:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.6},
  URN =		{urn:nbn:de:0030-drops-121318},
  doi =		{10.4230/LIPIcs.CPM.2020.6},
  annote =	{Keywords: Tree distance, Cancer progression, Approximation algorithms, Fine-grained complexity}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.7

String Sanitization Under Edit Distance

Authors: Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering

Abstract

Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of 𝒮 occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in 𝒪(kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS.

Cite as

Giulia Bernardini, Huiping Chen, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Leen Stougie, and Michelle Sweering. String Sanitization Under Edit Distance. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{bernardini_et_al:LIPIcs.CPM.2020.7,
  author =	{Bernardini, Giulia and Chen, Huiping and Loukides, Grigorios and Pisanti, Nadia and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle},
  title =	{{String Sanitization Under Edit Distance}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{7:1--7:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.7},
  URN =		{urn:nbn:de:0030-drops-121324},
  doi =		{10.4230/LIPIcs.CPM.2020.7},
  annote =	{Keywords: String algorithms, data sanitization, edit distance, dynamic programming, conditional lower bound}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.8

Counting Distinct Patterns in Internal Dictionary Matching

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba

Abstract

We consider the problem of preprocessing a text T of length n and a dictionary 𝒟 in order to be able to efficiently answer queries CountDistinct(i,j), that is, given i and j return the number of patterns from 𝒟 that occur in the fragment T[i..j]. The dictionary is internal in the sense that each pattern in 𝒟 is given as a fragment of T. This way, the dictionary takes space proportional to the number of patterns d=|𝒟| rather than their total length, which could be Θ(n⋅ d). An 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 𝒪(log n)-approximately in 𝒪̃(1) time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 2-approximately in 𝒪̃(1) time. Using range queries, for any m, we give an 𝒪̃(min(nd/m,n²/m²)+d)-size data structure that answers CountDistinct(i,j) queries exactly in 𝒪̃(m) time. We also consider the special case when the dictionary consists of all square factors of the string. We design an 𝒪(n log² n)-size data structure that allows us to count distinct squares in a text fragment T[i..j] in 𝒪(log n) time.

Cite as

Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. Counting Distinct Patterns in Internal Dictionary Matching. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 8:1-8:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.CPM.2020.8,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Mohamed, Manal and Radoszewski, Jakub and Rytter, Wojciech and Straszy\'{n}ski, Juliusz and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Counting Distinct Patterns in Internal Dictionary Matching}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{8:1--8:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.8},
  URN =		{urn:nbn:de:0030-drops-121336},
  doi =		{10.4230/LIPIcs.CPM.2020.8},
  annote =	{Keywords: dictionary matching, internal pattern matching, squares}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.9

Dynamic String Alignment

Authors: Panagiotis Charalampopoulos, Tomasz Kociumaka, and Shay Mozes

Abstract

We consider the problem of dynamically maintaining an optimal alignment of two strings, each of length at most n, as they undergo insertions, deletions, and substitutions of letters. The string alignment problem generalizes the longest common subsequence (LCS) problem and the edit distance problem (also with non-unit costs, as long as insertions and deletions cost the same). The conditional lower bound of Backurs and Indyk [J. Comput. 2018] for computing the LCS in the static case implies that strongly sublinear update time for the dynamic string alignment problem would refute the Strong Exponential Time Hypothesis. We essentially match this lower bound when the alignment weights are constants, by showing how to process each update in 𝒪̃(n) time. When the weights are integers bounded in absolute value by some w=n^{𝒪(1)}, we can maintain the alignment in 𝒪̃(n ⋅ min {√ n,w}) time per update. For the 𝒪̃(nw)-time algorithm, we heavily rely on Tiskin’s work on semi-local LCS, and in particular, in an implicit way, on his algorithm for computing the (min,+)-product of two simple unit-Monge matrices [Algorithmica 2015]. As for the 𝒪̃(n√n)-time algorithm, we employ efficient data structures for computing distances in planar graphs.

Cite as

Panagiotis Charalampopoulos, Tomasz Kociumaka, and Shay Mozes. Dynamic String Alignment. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 9:1-9:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.CPM.2020.9,
  author =	{Charalampopoulos, Panagiotis and Kociumaka, Tomasz and Mozes, Shay},
  title =	{{Dynamic String Alignment}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{9:1--9:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.9},
  URN =		{urn:nbn:de:0030-drops-121344},
  doi =		{10.4230/LIPIcs.CPM.2020.9},
  annote =	{Keywords: string alignment, edit distance, longest common subsequence, (unit-)Monge matrices, (min,+)-product}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.10

Unary Words Have the Smallest Levenshtein k-Neighbourhoods

Authors: Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Tomasz Waleń, and Wiktor Zuba

Abstract

The edit distance (a.k.a. the Levenshtein distance) between two words is defined as the minimum number of insertions, deletions or substitutions of letters needed to transform one word into another. The Levenshtein k-neighbourhood of a word w is the set of words that are at edit distance at most k from w. This is perhaps the most important concept underlying BLAST, a widely-used tool for comparing biological sequences. A natural combinatorial question is to ask for upper and lower bounds on the size of this set. The answer to this question has important algorithmic implications as well. Myers notes that "such bounds would give a tighter characterisation of the running time of the algorithm" behind BLAST. We show that the size of the Levenshtein k-neighbourhood of any word of length n over an arbitrary alphabet is not smaller than the size of the Levenshtein k-neighbourhood of a unary word of length n, thus providing a tight lower bound on the size of the Levenshtein k-neighbourhood. We remark that this result was posed as a conjecture by Dufresne at WCTA 2019.

Cite as

Panagiotis Charalampopoulos, Solon P. Pissis, Jakub Radoszewski, Tomasz Waleń, and Wiktor Zuba. Unary Words Have the Smallest Levenshtein k-Neighbourhoods. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 10:1-10:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{charalampopoulos_et_al:LIPIcs.CPM.2020.10,
  author =	{Charalampopoulos, Panagiotis and Pissis, Solon P. and Radoszewski, Jakub and Wale\'{n}, Tomasz and Zuba, Wiktor},
  title =	{{Unary Words Have the Smallest Levenshtein k-Neighbourhoods}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{10:1--10:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.10},
  URN =		{urn:nbn:de:0030-drops-121359},
  doi =		{10.4230/LIPIcs.CPM.2020.10},
  annote =	{Keywords: combinatorics on words, Levenshtein distance, edit distance}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.11

Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions

Authors: Patty Commins, David Liben-Nowell, Tina Liu, and Kiran Tomlinson

Abstract

Algorithms to find optimal alignments among strings, or to find a parsimonious summary of a collection of strings, are well studied in a variety of contexts, addressing a wide range of interesting applications. In this paper, we consider chain letters, which contain a growing sequence of signatories added as the letter propagates. The unusual constellation of features exhibited by chain letters (one-ended growth, divergence, and mutation) make their propagation, and thus the corresponding reconstruction problem, both distinctive and rich. Here, inspired by these chain letters, we formally define the problem of computing an optimal summary of a set of diverging string sequences. From a collection of these sequences of names, with each sequence noisily corresponding to a branch of the unknown tree T representing the letter’s true dissemination, can we efficiently and accurately reconstruct a tree T' ≈ T? In this paper, we give efficient exact algorithms for this summarization problem when the number of sequences is small; for larger sets of sequences, we prove hardness and provide an efficient heuristic algorithm. We evaluate this heuristic on synthetic data sets chosen to emulate real chain letters, showing that our algorithm is competitive with or better than previous approaches, and that it also comes close to finding the true trees in these synthetic datasets.

Cite as

Patty Commins, David Liben-Nowell, Tina Liu, and Kiran Tomlinson. Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 11:1-11:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{commins_et_al:LIPIcs.CPM.2020.11,
  author =	{Commins, Patty and Liben-Nowell, David and Liu, Tina and Tomlinson, Kiran},
  title =	{{Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{11:1--11:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.11},
  URN =		{urn:nbn:de:0030-drops-121367},
  doi =		{10.4230/LIPIcs.CPM.2020.11},
  annote =	{Keywords: edit distance, tree reconstruction, information propagation, chain letters}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.12

Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences

Authors: Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, and Ayumi Shinohara

Abstract

The equidistant subsequence pattern matching problem is considered. Given a pattern string P and a text string T, we say that P is an equidistant subsequence of T if P is a subsequence of the text such that consecutive symbols of P in the occurrence are equally spaced. We can consider the problem of equidistant subsequences as generalizations of (sub-)cadences. We give bit-parallel algorithms that yield o(n²) time algorithms for finding k-(sub-)cadences and equidistant subsequences. Furthermore, O(nlog² n) and O(nlog n) time algorithms, respectively for equidistant and Abelian equidistant matching for the case |P| = 3, are shown. The algorithms make use of a technique that was recently introduced which can efficiently compute convolutions with linear constraints.

Cite as

Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, and Ayumi Shinohara. Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 12:1-12:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{funakoshi_et_al:LIPIcs.CPM.2020.12,
  author =	{Funakoshi, Mitsuru and Nakashima, Yuto and Inenaga, Shunsuke and Bannai, Hideo and Takeda, Masayuki and Shinohara, Ayumi},
  title =	{{Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{12:1--12:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.12},
  URN =		{urn:nbn:de:0030-drops-121375},
  doi =		{10.4230/LIPIcs.CPM.2020.12},
  annote =	{Keywords: string algorithms, pattern matching, bit parallelism, subsequences, cadences}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.13

FM-Index Reveals the Reverse Suffix Array

Authors: Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, and Sharma V. Thankachan

Abstract

Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlog n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM - Index, FOCS'00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in log^{O(1)} n time. We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM' 09]. Known approaches maintain the FM - Index of both the forward and the reverse text which drives up the space occupancy to 2nlog σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using nlog σ bits (plus lower order terms). We answer this question positively, and show that given the FM - Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM - Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.

Cite as

Arnab Ganguly, Daniel Gibney, Sahar Hooshmand, M. Oğuzhan Külekci, and Sharma V. Thankachan. FM-Index Reveals the Reverse Suffix Array. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{ganguly_et_al:LIPIcs.CPM.2020.13,
  author =	{Ganguly, Arnab and Gibney, Daniel and Hooshmand, Sahar and K\"{u}lekci, M. O\u{g}uzhan and Thankachan, Sharma V.},
  title =	{{FM-Index Reveals the Reverse Suffix Array}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{13:1--13:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.13},
  URN =		{urn:nbn:de:0030-drops-121388},
  doi =		{10.4230/LIPIcs.CPM.2020.13},
  annote =	{Keywords: Data Structures, Suffix Trees, String Algorithms, Compression, Burrows - Wheeler transform, FM-Index}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.14

On Indeterminate Strings Matching

Authors: Paweł Gawrychowski, Samah Ghazawi, and Gad M. Landau

Abstract

Given two indeterminate equal-length strings p and t with a set of characters per position in both strings, we obtain a determinate string p_w from p and a determinate string t_w from t by choosing one character per position. Then, we say that p and t match when p_w and t_w match for some choice of the characters. While the most standard notion of a match for determinate strings is that they are simply identical, in certain applications it is more appropriate to use other definitions, with the prime examples being parameterized matching, order-preserving matching, and the recently introduced Cartesian tree matching. We provide a systematic study of the complexity of string matching for indeterminate equal-length strings, for different notions of matching. We use n to denote the length of both strings, and r to be an upper-bound on the number of uncertain characters per position. First, we provide the first polynomial time algorithm for the Cartesian tree version that runs in deterministic 𝒪(nlog² n) and expected 𝒪(nlog nlog log n) time using 𝒪(nlog n) space, for constant r. Second, we establish NP-hardness of the order-preserving version for r=2, thus solving a question explicitly stated by Henriques et al. [CPM 2018], who showed hardness for r=3. Third, we establish NP-hardness of the parameterized version for r=2. As both parameterized and order-preserving indeterminate matching reduce to the standard determinate matching for r=1, this provides a complete classification for these three variants.

Cite as

Paweł Gawrychowski, Samah Ghazawi, and Gad M. Landau. On Indeterminate Strings Matching. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 14:1-14:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.CPM.2020.14,
  author =	{Gawrychowski, Pawe{\l} and Ghazawi, Samah and Landau, Gad M.},
  title =	{{On Indeterminate Strings Matching}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{14:1--14:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.14},
  URN =		{urn:nbn:de:0030-drops-121393},
  doi =		{10.4230/LIPIcs.CPM.2020.14},
  annote =	{Keywords: string matching, indeterminate strings, Cartesian trees, order-preserving matching, parameterized matching}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.15

The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time

Authors: Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat

Abstract

We revisit the k-mismatch problem in the streaming model on a pattern of length m and a streaming text of length n, both over a size-σ alphabet. The current state-of-the-art algorithm for the streaming k-mismatch problem, by Clifford et al. [SODA 2019], uses Õ(k) space and Õ(√k) worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is Õ(n√k), and the fastest known offline algorithm, which costs Õ(n + min(nk/√m, σn)) time. Moreover, it is not known whether improvements over the Õ(n√k) total time are possible when using more than O(k) space. We address these gaps by designing a randomized streaming algorithm for the k-mismatch problem that, given an integer parameter k≤s≤m, uses Õ(s) space and costs Õ(n+min(nk²/m, nk/√s, σnm/s)) total time. For s=m, the total runtime becomes Õ(n + min(nk/√m, σn)), which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still Õ(√k).

Cite as

Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{golan_et_al:LIPIcs.CPM.2020.15,
  author =	{Golan, Shay and Kociumaka, Tomasz and Kopelowitz, Tsvi and Porat, Ely},
  title =	{{The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.15},
  URN =		{urn:nbn:de:0030-drops-121406},
  doi =		{10.4230/LIPIcs.CPM.2020.15},
  annote =	{Keywords: Streaming pattern matching, Hamming distance, k-mismatch}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.16

Approximating Longest Common Substring with k mismatches: Theory and Practice

Authors: Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya

Abstract

In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length 𝓁 such that there is a length-𝓁 substring of X and a length-𝓁 substring of Y that differ in at most k positions. The length 𝓁 can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing 𝓁 that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

Cite as

Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Approximating Longest Common Substring with k mismatches: Theory and Practice. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{gourdel_et_al:LIPIcs.CPM.2020.16,
  author =	{Gourdel, Garance and Kociumaka, Tomasz and Radoszewski, Jakub and Starikovskaya, Tatiana},
  title =	{{Approximating Longest Common Substring with k mismatches: Theory and Practice}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{16:1--16:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.16},
  URN =		{urn:nbn:de:0030-drops-121410},
  doi =		{10.4230/LIPIcs.CPM.2020.16},
  annote =	{Keywords: approximation algorithms, string similarity, LSH, conditional lower bounds}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.17

String Factorizations Under Various Collision Constraints

Authors: Niels Grüttemeier, Christian Komusiewicz, Nils Morawietz, and Frank Sommer

Abstract

In the NP-hard Equality-Free String Factorization problem, we are given a string S and ask whether S can be partitioned into k factors that are pairwise distinct. We describe a randomized algorithm for Equality-Free String Factorization with running time 2^k⋅ k^{𝒪(1)}+𝒪(n) improving over previous algorithms with running time k^{𝒪(k)}+𝒪(n) [Schmid, TCS 2016; Mincu and Popa, Proc. SOFSEM 2020]. Our algorithm works for the generalization of Equality-Free String Factorization where equality can be replaced by an arbitrary polynomial-time computable equivalence relation on strings. We also consider two factorization problems to which this algorithm does not apply, namely Prefix-Free String Factorization where we ask for a factorization of size k such that no factor is a prefix of another factor and Substring-Free String Factorization where we ask for a factorization of size k such that no factor is a substring of another factor. We show that these two problems are NP-hard as well. Then, we show that Prefix-Free String Factorization with the prefix-free relation is fixed-parameter tractable with respect to k by providing a polynomial problem kernel. Finally, we show a generic ILP formulation for R-Free String Factorization where R is an arbitrary relation on strings. This formulation improves over a previous one for Equality-Free String Factorization in terms of the number of variables.

Cite as

Niels Grüttemeier, Christian Komusiewicz, Nils Morawietz, and Frank Sommer. String Factorizations Under Various Collision Constraints. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{gruttemeier_et_al:LIPIcs.CPM.2020.17,
  author =	{Gr\"{u}ttemeier, Niels and Komusiewicz, Christian and Morawietz, Nils and Sommer, Frank},
  title =	{{String Factorizations Under Various Collision Constraints}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{17:1--17:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.17},
  URN =		{urn:nbn:de:0030-drops-121428},
  doi =		{10.4230/LIPIcs.CPM.2020.17},
  annote =	{Keywords: NP-hard problem, fixed-parameter algorithms, collision-aware string partitioning}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.18

k-Approximate Quasiperiodicity under Hamming and Edit Distance

Authors: Aleksander Kędzierski and Jakub Radoszewski

Abstract

Quasiperiodicity in strings was introduced almost 30 years ago as an extension of string periodicity. The basic notions of quasiperiodicity are cover and seed. A cover of a text T is a string whose occurrences in T cover all positions of T. A seed of text T is a cover of a superstring of T. In various applications exact quasiperiodicity is still not sufficient due to the presence of errors. We consider approximate notions of quasiperiodicity, for which we allow approximate occurrences in T with a small Hamming, Levenshtein or weighted edit distance. In previous work Sip et al. (2002) and Christodoulakis et al. (2005) showed that computing approximate covers and seeds, respectively, under weighted edit distance is NP-hard. They, therefore, considered restricted approximate covers and seeds which need to be factors of the original string T and presented polynomial-time algorithms for computing them. Further algorithms, considering approximate occurrences with Hamming distance bounded by k, were given in several contributions by Guth et al. They also studied relaxed approximate quasiperiods that do not need to cover all positions of T. In case of large data the exponents in polynomial time complexity play a crucial role. We present more efficient algorithms for computing restricted approximate covers and seeds. In particular, we improve upon the complexities of many of the aforementioned algorithms, also for relaxed quasiperiods. Our solutions are especially efficient if the number (or total cost) of allowed errors is bounded. We also show NP-hardness of computing non-restricted approximate covers and seeds under Hamming distance. Approximate covers were studied in three recent contributions at CPM over the last three years. However, these works consider a different definition of an approximate cover of T, that is, the shortest exact cover of a string T' with the smallest Hamming distance from T.

Cite as

Aleksander Kędzierski and Jakub Radoszewski. k-Approximate Quasiperiodicity under Hamming and Edit Distance. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{kedzierski_et_al:LIPIcs.CPM.2020.18,
  author =	{K\k{e}dzierski, Aleksander and Radoszewski, Jakub},
  title =	{{k-Approximate Quasiperiodicity under Hamming and Edit Distance}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{18:1--18:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.18},
  URN =		{urn:nbn:de:0030-drops-121437},
  doi =		{10.4230/LIPIcs.CPM.2020.18},
  annote =	{Keywords: approximate cover, approximate seed, enhanced cover, Hamming distance, edit distance}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.19

Longest Common Subsequence on Weighted Sequences

Authors: Evangelos Kipouridis and Kostas Tsichlas

Abstract

We consider the general problem of the Longest Common Subsequence (LCS) on weighted sequences. Weighted sequences are an extension of classical strings, where in each position every letter of the alphabet may occur with some probability. Previous results presented a PTAS and noticed that no FPTAS is possible unless P=NP. In this paper we essentially close the gap between upper and lower bounds by improving both. First of all, we provide an EPTAS for bounded alphabets (which is the most natural case), and prove that there does not exist any EPTAS for unbounded alphabets unless FPT=W[1]. Furthermore, under the Exponential Time Hypothesis, we provide a lower bound which shows that no significantly better PTAS can exist for unbounded alphabets. As a side note, we prove that it is sufficient to work with only one threshold in the general variant of the problem.

Cite as

Evangelos Kipouridis and Kostas Tsichlas. Longest Common Subsequence on Weighted Sequences. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 19:1-19:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{kipouridis_et_al:LIPIcs.CPM.2020.19,
  author =	{Kipouridis, Evangelos and Tsichlas, Kostas},
  title =	{{Longest Common Subsequence on Weighted Sequences}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{19:1--19:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.19},
  URN =		{urn:nbn:de:0030-drops-121443},
  doi =		{10.4230/LIPIcs.CPM.2020.19},
  annote =	{Keywords: WLCS, LCS, weighted sequences, approximation algorithms, lower bound}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.20

Parameterized Algorithms for Matrix Completion with Radius Constraints

Authors: Tomohiro Koana, Vincent Froese, and Rolf Niedermeier

Abstract

Considering matrices with missing entries, we study NP-hard matrix completion problems where the resulting completed matrix should have limited (local) radius. In the pure radius version, this means that the goal is to fill in the entries such that there exists a "center string" which has Hamming distance to all matrix rows as small as possible. In stringology, this problem is also known as Closest String with Wildcards. In the local radius version, the requested center string must be one of the rows of the completed matrix. Hermelin and Rozenberg [CPM 2014, TCS 2016] performed a parameterized complexity analysis for Closest String with Wildcards. We answer one of their open questions, fix a bug concerning a fixed-parameter tractability result in their work, and improve some running time upper bounds. For the local radius case, we reveal a computational complexity dichotomy. In general, our results indicate that, although being NP-hard as well, this variant often allows for faster (fixed-parameter) algorithms.

Cite as

Tomohiro Koana, Vincent Froese, and Rolf Niedermeier. Parameterized Algorithms for Matrix Completion with Radius Constraints. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 20:1-20:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{koana_et_al:LIPIcs.CPM.2020.20,
  author =	{Koana, Tomohiro and Froese, Vincent and Niedermeier, Rolf},
  title =	{{Parameterized Algorithms for Matrix Completion with Radius Constraints}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{20:1--20:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.20},
  URN =		{urn:nbn:de:0030-drops-121456},
  doi =		{10.4230/LIPIcs.CPM.2020.20},
  annote =	{Keywords: fixed-parameter tractability, consensus string problems, Closest String, Closest String with Wildcards}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.21

In-Place Bijective Burrows-Wheeler Transforms

Authors: Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara

Abstract

One of the most well-known variants of the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994] is the bijective BWT (BBWT) [Gil and Scott, arXiv 2012], which applies the extended BWT (EBWT) [Mantaci et al., TCS 2007] to the multiset of Lyndon factors of a given text. Since the EBWT is invertible, the BBWT is a bijective transform in the sense that the inverse image of the EBWT restores this multiset of Lyndon factors such that the original text can be obtained by sorting these factors in non-increasing order. In this paper, we present algorithms constructing or inverting the BBWT in-place using quadratic time. We also present conversions from the BBWT to the BWT, or vice versa, either (a) in-place using quadratic time, or (b) in the run-length compressed setting using 𝒪(n lg r / lg lg r) time with 𝒪(r lg n) bits of words, where r is the sum of character runs in the BWT and the BBWT.

Cite as

Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara. In-Place Bijective Burrows-Wheeler Transforms. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 21:1-21:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{koppl_et_al:LIPIcs.CPM.2020.21,
  author =	{K\"{o}ppl, Dominik and Hashimoto, Daiki and Hendrian, Diptarama and Shinohara, Ayumi},
  title =	{{In-Place Bijective Burrows-Wheeler Transforms}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{21:1--21:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.21},
  URN =		{urn:nbn:de:0030-drops-121463},
  doi =		{10.4230/LIPIcs.CPM.2020.21},
  annote =	{Keywords: In-Place Algorithms, Burrows-Wheeler transform, Lyndon words}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.22

Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

Authors: Manuel Lafond, Binhai Zhu, and Peng Zou

Abstract

Recently, due to the genomic sequence analysis in several types of cancer, genomic data based on copy number profiles (CNP for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific segment of interest. The motivation is that in the late stage of certain types of cancer, the genomes are progressing rapidly by segmental duplications and deletions, and hence obtaining the exact sequences becomes difficult. Instead, the number of copies of important segments can be predicted from expression analysis and carries important biological information. Therefore, significant research has recently been devoted to the analysis of genomic data represented as CNP’s. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. The Minimum Copy Number Generation (MCNG) is defined as follows: given a string S in which each character represents a gene or segment, and a CNP C, compute a string T from S, with the minimum number of segmental duplications and deletions, such that cnp(T)=C. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications and/or deletions are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. This is achieved through a general-purpose lemma on set-cover reductions that require an exact cover in one direction, but not the other, which might be of independent interest. We also prove that the corresponding parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP’s. The Copy Number Profile Conforming (CNPC) problem is formally defined as follows: given two CNP’s C₁ and C₂, compute two strings S₁ and S₂ with cnp(S₁)=C₁ and cnp(S₂)=C₂ such that the distance between S₁ and S₂, d(S₁,S₂), is minimized. Here, d(S₁,S₂) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S₁,S₂) is measured by the breakpoint distance then the problem is polynomially solvable. We expect that this will trigger some related research along the line in the near future.

Cite as

Manuel Lafond, Binhai Zhu, and Peng Zou. Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{lafond_et_al:LIPIcs.CPM.2020.22,
  author =	{Lafond, Manuel and Zhu, Binhai and Zou, Peng},
  title =	{{Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{22:1--22:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.22},
  URN =		{urn:nbn:de:0030-drops-121471},
  doi =		{10.4230/LIPIcs.CPM.2020.22},
  annote =	{Keywords: Computational genomics, cancer genomics, copy number profiles, NP-hardness, approximation algorithms, FPT algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.23

Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP

Authors: Kotaro Matsuda, Kunihiko Sadakane, Tatiana Starikovskaya, and Masakazu Tateshita

Abstract

We propose a space-efficient data structure for orthogonal range search on suffix arrays. For general two-dimensional orthogonal range search problem on a set of n points, there exists an n log n (1+o(1))-bit data structure supporting O(log n)-time counting queries [Mäkinen, Navarro 2007]. The space matches the information-theoretic lower bound. However, if we focus on a point set representing a suffix array, there is a chance to obtain a space efficient data structure. We answer this question affirmatively. Namely, we propose a data structure for orthogonal range search on suffix arrays which uses O(1/(ε) n (H₀+1)) bits where H₀ is the order-0 entropy of the string and answers a counting query in O(n^ε) time for any constant ε>0. As an application, we give an O(1/(ε) n (H₀+1))-bit data structure for the range LCP problem.

Cite as

Kotaro Matsuda, Kunihiko Sadakane, Tatiana Starikovskaya, and Masakazu Tateshita. Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{matsuda_et_al:LIPIcs.CPM.2020.23,
  author =	{Matsuda, Kotaro and Sadakane, Kunihiko and Starikovskaya, Tatiana and Tateshita, Masakazu},
  title =	{{Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{23:1--23:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.23},
  URN =		{urn:nbn:de:0030-drops-121482},
  doi =		{10.4230/LIPIcs.CPM.2020.23},
  annote =	{Keywords: Orthogonal Range Search, Succinct Data Structure, Suffix Array}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.24

Text Indexing and Searching in Sublinear Time

Authors: J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich

Abstract

We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size σ, our index uses O(n log σ) bits, is built in O(n log σ / √{log n}) deterministic time, and computes the number of occurrences of the pattern in time O(q/log_σ n + log n log_σ n). Each such occurrence can then be found in O(log n) time. Other trade-offs between the space usage and the cost of reporting occurrences are also possible.

Cite as

J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Text Indexing and Searching in Sublinear Time. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 24:1-24:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{munro_et_al:LIPIcs.CPM.2020.24,
  author =	{Munro, J. Ian and Navarro, Gonzalo and Nekrich, Yakov},
  title =	{{Text Indexing and Searching in Sublinear Time}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{24:1--24:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.24},
  URN =		{urn:nbn:de:0030-drops-121497},
  doi =		{10.4230/LIPIcs.CPM.2020.24},
  annote =	{Keywords: data structures, string indexes}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.25

Chaining with Overlaps Revisited

Authors: Veli Mäkinen and Kristoffer Sahlin

Abstract

Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several O(n log n) time algorithms to solve this problem optimally, where n is the number of input anchors. In this paper, we focus on a formulation allowing the anchors to overlap in a chain. This formulation was studied by Shibuya and Kurochkin (WABI 2003), but their algorithm comes with no proof of correctness. We revisit and modify their algorithm to consider a strict definition of precedence relation on anchors, adding the required derivation to convince on the correctness of the resulting algorithm that runs in O(n log² n) time on anchors formed by exact matches. With the more relaxed definition of precedence relation considered by Shibuya and Kurochkin or when anchors are non-nested such as matches of uniform length (k-mers), the algorithm takes O(n log n) time. We also establish a connection between chaining with overlaps and the widely studied longest common subsequence problem.

Cite as

Veli Mäkinen and Kristoffer Sahlin. Chaining with Overlaps Revisited. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 25:1-25:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{makinen_et_al:LIPIcs.CPM.2020.25,
  author =	{M\"{a}kinen, Veli and Sahlin, Kristoffer},
  title =	{{Chaining with Overlaps Revisited}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{25:1--25:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.25},
  URN =		{urn:nbn:de:0030-drops-121500},
  doi =		{10.4230/LIPIcs.CPM.2020.25},
  annote =	{Keywords: Sparse Dynamic Programming, Chaining, Maximal Exact Matches, Longest Common Subsequence}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.26

DAWGs for Parameterized Matching: Online Construction and Related Indexing Structures

Authors: Katsuhito Nakashima, Noriki Fujisato, Diptarama Hendrian, Yuto Nakashima, Ryo Yoshinaka, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, and Masayuki Takeda

Abstract

Two strings x and y over Σ ∪ Π of equal length are said to parameterized match (p-match) if there is a renaming bijection f:Σ ∪ Π → Σ ∪ Π that is identity on Σ and transforms x to y (or vice versa). The p-matching problem is to look for substrings in a text that p-match a given pattern. In this paper, we propose parameterized suffix automata (p-suffix automata) and parameterized directed acyclic word graphs (PDAWGs) which are the p-matching versions of suffix automata and DAWGs. While suffix automata and DAWGs are equivalent for standard strings, we show that p-suffix automata can have Θ(n²) nodes and edges but PDAWGs have only O(n) nodes and edges, where n is the length of an input string. We also give O(n |Π| log (|Π| + |Σ|))-time O(n)-space algorithm that builds the PDAWG in a left-to-right online manner. As a byproduct, it is shown that the parameterized suffix tree for the reversed string can also be built in the same time and space, in a right-to-left online manner.

Cite as

Katsuhito Nakashima, Noriki Fujisato, Diptarama Hendrian, Yuto Nakashima, Ryo Yoshinaka, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, and Masayuki Takeda. DAWGs for Parameterized Matching: Online Construction and Related Indexing Structures. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 26:1-26:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{nakashima_et_al:LIPIcs.CPM.2020.26,
  author =	{Nakashima, Katsuhito and Fujisato, Noriki and Hendrian, Diptarama and Nakashima, Yuto and Yoshinaka, Ryo and Inenaga, Shunsuke and Bannai, Hideo and Shinohara, Ayumi and Takeda, Masayuki},
  title =	{{DAWGs for Parameterized Matching: Online Construction and Related Indexing Structures}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{26:1--26:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.26},
  URN =		{urn:nbn:de:0030-drops-121512},
  doi =		{10.4230/LIPIcs.CPM.2020.26},
  annote =	{Keywords: parameterized matching, suffix trees, DAWGs, suffix automata}
}

@InProceedings{nakashima_et_al:LIPIcs.CPM.2020.26,
  author =	{Nakashima, Katsuhito and Fujisato, Noriki and Hendrian, Diptarama and Nakashima, Yuto and Yoshinaka, Ryo and Inenaga, Shunsuke and Bannai, Hideo and Shinohara, Ayumi and Takeda, Masayuki},
  title =	{{DAWGs for Parameterized Matching: Online Construction and Related Indexing Structures}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{26:1--26:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.26},
  URN =		{urn:nbn:de:0030-drops-121512},
  doi =		{10.4230/LIPIcs.CPM.2020.26},
  annote =	{Keywords: parameterized matching, suffix trees, DAWGs, suffix automata}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.27

On Extensions of Maximal Repeats in Compressed Strings

Authors: Julian Pape-Lange

Abstract

This paper provides upper bounds for several subsets of maximal repeats and maximal pairs in compressed strings and also presents a formerly unknown relationship between maximal pairs and the run-length Burrows-Wheeler transform. This relationship is used to obtain a different proof for the Burrows-Wheeler conjecture which has recently been proven by Kempa and Kociumaka in "Resolution of the Burrows-Wheeler Transform Conjecture". More formally, this paper proves that the run-length Burrows-Wheeler transform of a string S with z_S LZ77-factors has at most 73(log₂ |S|)(z_S+2)² runs, and if S does not contain q-th powers, the number of arcs in the compacted directed acyclic word graph of S is bounded from above by 18q(1+log_q |S|)(z_S+2)².

Cite as

Julian Pape-Lange. On Extensions of Maximal Repeats in Compressed Strings. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 27:1-27:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{papelange:LIPIcs.CPM.2020.27,
  author =	{Pape-Lange, Julian},
  title =	{{On Extensions of Maximal Repeats in Compressed Strings}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{27:1--27:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.27},
  URN =		{urn:nbn:de:0030-drops-121523},
  doi =		{10.4230/LIPIcs.CPM.2020.27},
  annote =	{Keywords: Maximal repeats, Extensions of maximal repeats, Combinatorics on compressed strings, LZ77, Burrows-Wheeler transform, Burrows-Wheeler transform conjecture, Compact suffix automata, CDAWGs}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.28

Faster Binary Mean Computation Under Dynamic Time Warping

Authors: Nathan Schaar, Vincent Froese, and Rolf Niedermeier

Abstract

Many consensus string problems are based on Hamming distance. We replace Hamming distance by the more flexible (e.g., easily coping with different input string lengths) dynamic time warping distance, best known from applications in time series mining. Doing so, we study the problem of finding a mean string that minimizes the sum of (squared) dynamic time warping distances to a given set of input strings. While this problem is known to be NP-hard (even for strings over a three-element alphabet), we address the binary alphabet case which is known to be polynomial-time solvable. We significantly improve on a previously known algorithm in terms of worst-case running time. Moreover, we also show the practical usefulness of one of our algorithms in experiments with real-world and synthetic data. Finally, we identify special cases solvable in linear time (e.g., finding a mean of only two binary input strings) and report some empirical findings concerning combinatorial properties of optimal means.

Cite as

Nathan Schaar, Vincent Froese, and Rolf Niedermeier. Faster Binary Mean Computation Under Dynamic Time Warping. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 28:1-28:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{schaar_et_al:LIPIcs.CPM.2020.28,
  author =	{Schaar, Nathan and Froese, Vincent and Niedermeier, Rolf},
  title =	{{Faster Binary Mean Computation Under Dynamic Time Warping}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{28:1--28:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.28},
  URN =		{urn:nbn:de:0030-drops-121538},
  doi =		{10.4230/LIPIcs.CPM.2020.28},
  annote =	{Keywords: consensus string problems, time series averaging, minimum 1-separated sum, sparse strings}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.29

Approximating Text-To-Pattern Distance via Dimensionality Reduction

Authors: Przemysław Uznański

Abstract

Text-to-pattern distance is a fundamental problem in string matching, where given a pattern of length m and a text of length n, over an integer alphabet, we are asked to compute the distance between pattern and the text at every location. The distance function can be e.g. Hamming distance or 𝓁_p distance for some parameter p > 0. Almost all state-of-the-art exact and approximate algorithms developed in the past ∼ 40 years were using FFT as a black-box. In this work we present 𝒪~(n/ε²) time algorithms for (1±ε)-approximation of 𝓁₂ distances, and 𝒪~(n/ε³) algorithm for approximation of Hamming and 𝓁₁ distances, all without use of FFT. This is independent to the very recent development by Chan et al. [STOC 2020], where 𝒪(n/ε²) algorithm for Hamming distances not using FFT was presented - although their algorithm is much more "combinatorial", our techniques apply to other norms than Hamming.

Cite as

Przemysław Uznański. Approximating Text-To-Pattern Distance via Dimensionality Reduction. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 29:1-29:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{uznanski:LIPIcs.CPM.2020.29,
  author =	{Uzna\'{n}ski, Przemys{\l}aw},
  title =	{{Approximating Text-To-Pattern Distance via Dimensionality Reduction}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{29:1--29:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.29},
  URN =		{urn:nbn:de:0030-drops-121542},
  doi =		{10.4230/LIPIcs.CPM.2020.29},
  annote =	{Keywords: Approximate Pattern Matching, 𝓁₂ Distance, 𝓁₁ Distance, Hamming Distance, Approximation Algorithms, Combinatorial Algorithms}
}

LIPIcs, Volume 161

31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Event

Editors

Publication Details

Access Numbers

Documents

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Filters

Thanks for your feedback!

Could not send message