eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
1
418
10.4230/LIPIcs.CPM.2020
article
LIPIcs, Volume 161, CPM 2020, Complete Volume
Gørtz, Inge Li
1
https://orcid.org/0000-0002-8322-4952
Weimann, Oren
2
https://orcid.org/0000-0002-4510-7552
Technical University of Denmark, DTU Compute, Lyngby, Denmark
University of Haifa, Israel
LIPIcs, Volume 161, CPM 2020, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020/LIPIcs.CPM.2020.pdf
LIPIcs, Volume 161, CPM 2020, Complete Volume
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
0:i
0:xvi
10.4230/LIPIcs.CPM.2020.0
article
Front Matter, Table of Contents, Preface, Conference Organization
Gørtz, Inge Li
1
https://orcid.org/0000-0002-8322-4952
Weimann, Oren
2
https://orcid.org/0000-0002-4510-7552
Technical University of Denmark, DTU Compute, Lyngby, Denmark
University of Haifa, Israel
Front Matter, Table of Contents, Preface, Conference Organization
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.0/LIPIcs.CPM.2020.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
1:1
1:1
10.4230/LIPIcs.CPM.2020.1
article
Algebraic Algorithms for Finding Patterns in Graphs (Invited Talk)
Husfeldt, Thore
1
2
https://orcid.org/0000-0001-9078-4512
IT University of Copenhagen, Denmark
Lund University, Sweden
I will give a gentle introduction to algebraic graph algorithms by showing how to determine if a given graph contains a simple path of length k. This is a famous problem admitting a beautiful and widely-known algorithm, namely the colour-coding method of Alon, Yuster and Zwick (1995). Starting from this entirely combinatorial approach, I will carefully develop an algebraic perspective on the same problem. First, I will explain how the colour-coding algorithm can be understood as the evaluation of a well-known expression (sometimes called the "walk-sum" of the graph) in a commutative algebra called the zeon algebra. From there, I will introduce the exterior algebra and present the algebraic framework recently developed with Brand and Dell (2018).
The presentation is aimed at a combinatorially-minded audience largely innocent of abstract algebra.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.1/LIPIcs.CPM.2020.1.pdf
paths
exterior algebra
wedge product
color-coding
parameterized complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
2:1
2:11
10.4230/LIPIcs.CPM.2020.2
article
Finding the Anticover of a String
Alzamel, Mai
1
2
https://orcid.org/0000-0002-0862-7452
Conte, Alessio
3
https://orcid.org/0000-0003-0770-2235
Denzumi, Shuhei
4
Grossi, Roberto
3
Iliopoulos, Costas S.
1
Kurita, Kazuhiro
5
https://orcid.org/0000-0002-7638-3322
Wasa, Kunihiro
6
Department of Informatics, King’s College London, UK
Department of Computer Science, King Saud University, KSA
Dipartimento di Informatica, Università di Pisa, Italy
Graduate School of Information Science and Technology, The University of Tokyo, Japan
IST, Hokkaido University, Sapporo, Japan
National Institute of Informatics, Tokyo, Japan
A k-anticover of a string x is a set of pairwise distinct factors of x of equal length k, such that every symbol of x is contained into an occurrence of at least one of those factors. The existence of a k-anticover can be seen as a notion of non-redundancy, which has application in computational biology, where they are associated with various non-regulatory mechanisms. In this paper we address the complexity of the problem of finding a k-anticover of a string x if it exists, showing that the decision problem is NP-complete on general strings for k ≥ 3. We also show that the problem admits a polynomial-time solution for k=2. For unbounded k, we provide an exact exponential algorithm to find a k-anticover of a string of length n (or determine that none exists), which runs in O*(min {3^{(n-k)/3)}, ((k(k+1))/2)^{n/(k+1)) time using polynomial space.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.2/LIPIcs.CPM.2020.2.pdf
Anticover
String algorithms
Stringology
NP-complete
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
3:1
3:13
10.4230/LIPIcs.CPM.2020.3
article
Double String Tandem Repeats
Amir, Amihood
1
Butman, Ayelet
2
Landau, Gad M.
3
4
Marcus, Shoshana
5
Sokol, Dina
6
Department of Computer Science, Bar Ilan University, Ramat-Gan, 52900, Israel
Department of Computer Science, Holon Institude of Technology, Golomb St 52, Holon, 5810201, Israel
Department of Computer Science, University of Haifa, Haifa 31905, Israel
NYU Tandon School of Engineering, New York University, Six MetroTech Center, Brooklyn, NY 11201, USA
Department of Mathematics and Computer Science, Kingsborough Community College of the City University of New York, 2001 Oriental Boulevard, Brooklyn, NY 11235, USA
Department of Computer and Information Science, Brooklyn College and The Graduate Center, City University of New York, Brooklyn, NY, USA
A tandem repeat is an occurrence of two adjacent identical substrings. In this paper, we introduce the notion of a double string, which consists of two parallel strings, and we study the problem of locating all tandem repeats in a double string. The problem introduced here has applications beyond actual double strings, as we illustrate by solving two different problems with the algorithm of the double string tandem repeats problem. The first problem is that of finding all corner-sharing tandems in a 2-dimensional text, defined by Apostolico and Brimkov. The second problem is that of finding all scaled tandem repeats in a 1d text, where a scaled tandem repeat is defined as a string UU' such that U' is discrete scale of U. In addition to the algorithms for exact tandem repeats, we also present algorithms that solve the problem in the inexact sense, allowing up to k mismatches. We believe that this framework will open a new perspective for other problems in the future.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.3/LIPIcs.CPM.2020.3.pdf
double string
tandem repeat
2-dimensional
scale
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
4:1
4:11
10.4230/LIPIcs.CPM.2020.4
article
Efficient Tree-Structured Categorical Retrieval
Belazzougui, Djamal
1
Kucherov, Gregory
2
3
https://orcid.org/0000-0001-5899-5424
CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
CNRS and LIGM/Univ Gustave Eiffel, Marne-la-Vallée, France
Skolkovo Institute of Science and Technology, Moscow, Russia
We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.4/LIPIcs.CPM.2020.4.pdf
pattern matching
document retrieval
category tree
space-efficient data structures
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
5:1
5:14
10.4230/LIPIcs.CPM.2020.5
article
Time-Space Tradeoffs for Finding a Long Common Substring
Ben-Nun, Stav
1
Golan, Shay
1
https://orcid.org/0000-0001-8357-2802
Kociumaka, Tomasz
1
https://orcid.org/0000-0002-2477-1702
Kraus, Matan
1
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
We consider the problem of finding, given two documents of total length n, a longest string occurring as a substring of both documents. This problem, known as the Longest Common Substring (LCS) problem, has a classic 𝒪(n)-time solution dating back to the discovery of suffix trees (Weiner, 1973) and their efficient construction for integer alphabets (Farach-Colton, 1997). However, these solutions require Θ(n) space, which is prohibitive in many applications. To address this issue, Starikovskaya and Vildhøj (CPM 2013) showed that for n^{2/3} ≤ s ≤ n, the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/s) time. Kociumaka et al. (ESA 2014) generalized this tradeoff to 1 ≤ s ≤ n, thus providing a smooth time-space tradeoff from constant to linear space. In this paper, we obtain a significant speed-up for instances where the length L of the sought LCS is large. For 1 ≤ s ≤ n, we show that the LCS problem can be solved in 𝒪(s) space and 𝒪̃(n²/(L⋅s) +n) time. The result is based on techniques originating from the LCS with Mismatches problem (Flouri et al., 2015; Charalampopoulos et al., CPM 2018), on space-efficient locally consistent parsing (Birenzwige et al., SODA 2020), and on the structure of maximal repetitions (runs) in the input documents.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.5/LIPIcs.CPM.2020.5.pdf
longest common substring
time-space tradeoff
local consistency
periodicity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
6:1
6:16
10.4230/LIPIcs.CPM.2020.6
article
On Two Measures of Distance Between Fully-Labelled Trees
Bernardini, Giulia
1
https://orcid.org/0000-0001-6647-088X
Bonizzoni, Paola
1
Gawrychowski, Paweł
2
University of Milano - Bicocca, Milan, Italy
Institute of Computer Science, University of Wrocław, Poland
The last decade brought a significant increase in the amount of data and a variety of new inference methods for reconstructing the detailed evolutionary history of various cancers. This brings the need of designing efficient procedures for comparing rooted trees representing the evolution of mutations in tumor phylogenies. Bernardini et al. [CPM 2019] recently introduced a notion of the rearrangement distance for fully-labelled trees motivated by this necessity. This notion originates from two operations: one that permutes the labels of the nodes, the other that affects the topology of the tree. Each operation alone defines a distance that can be computed in polynomial time, while the actual rearrangement distance, that combines the two, was proven to be NP-hard.
We answer two open question left unanswered by the previous work. First, what is the complexity of computing the permutation distance? Second, is there a constant-factor approximation algorithm for estimating the rearrangement distance between two arbitrary trees? We answer the first one by showing, via a two-way reduction, that calculating the permutation distance between two trees on n nodes is equivalent, up to polylogarithmic factors, to finding the largest cardinality matching in a sparse bipartite graph. In particular, by plugging in the algorithm of Liu and Sidford [ArXiv 2020], we obtain an 𝒪̃(n^{4/3+o(1}) time algorithm for computing the permutation distance between two trees on n nodes. Then we answer the second question positively, and design a linear-time constant-factor approximation algorithm that does not need any assumption on the trees.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.6/LIPIcs.CPM.2020.6.pdf
Tree distance
Cancer progression
Approximation algorithms
Fine-grained complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
7:1
7:14
10.4230/LIPIcs.CPM.2020.7
article
String Sanitization Under Edit Distance
Bernardini, Giulia
1
https://orcid.org/0000-0001-6647-088X
Chen, Huiping
2
Loukides, Grigorios
2
https://orcid.org/0000-0003-0888-5061
Pisanti, Nadia
3
4
https://orcid.org/0000-0003-3915-7665
Pissis, Solon P.
5
6
4
https://orcid.org/0000-0002-1445-1932
Stougie, Leen
5
6
4
Sweering, Michelle
5
University of Milano - Bicocca, Milan, Italy
King’s College London, UK
University of Pisa, Italy
ERABLE Team, Lyon, France
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of 𝒮 occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in 𝒪(kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.7/LIPIcs.CPM.2020.7.pdf
String algorithms
data sanitization
edit distance
dynamic programming
conditional lower bound
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
8:1
8:15
10.4230/LIPIcs.CPM.2020.8
article
Counting Distinct Patterns in Internal Dictionary Matching
Charalampopoulos, Panagiotis
1
2
https://orcid.org/0000-0002-6024-1557
Kociumaka, Tomasz
3
https://orcid.org/0000-0002-2477-1702
Mohamed, Manal
4
https://orcid.org/0000-0002-1435-5051
Radoszewski, Jakub
2
5
https://orcid.org/0000-0002-0067-6401
Rytter, Wojciech
2
https://orcid.org/0000-0002-9162-6724
Straszyński, Juliusz
2
https://orcid.org/0000-0003-2207-0053
Waleń, Tomasz
2
https://orcid.org/0000-0002-7369-3309
Zuba, Wiktor
2
https://orcid.org/0000-0002-1988-3507
King’s College London, UK
University of Warsaw, Poland
Bar-Ilan University, Ramat Gan, Israel
London, UK
Samsung R&D, Warsaw, Poland
We consider the problem of preprocessing a text T of length n and a dictionary 𝒟 in order to be able to efficiently answer queries CountDistinct(i,j), that is, given i and j return the number of patterns from 𝒟 that occur in the fragment T[i..j]. The dictionary is internal in the sense that each pattern in 𝒟 is given as a fragment of T. This way, the dictionary takes space proportional to the number of patterns d=|𝒟| rather than their total length, which could be Θ(n⋅ d). An 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 𝒪(log n)-approximately in 𝒪̃(1) time was recently proposed in a work that introduced internal dictionary matching [ISAAC 2019]. Here we present an 𝒪̃(n+d)-size data structure that answers CountDistinct(i,j) queries 2-approximately in 𝒪̃(1) time. Using range queries, for any m, we give an 𝒪̃(min(nd/m,n²/m²)+d)-size data structure that answers CountDistinct(i,j) queries exactly in 𝒪̃(m) time. We also consider the special case when the dictionary consists of all square factors of the string. We design an 𝒪(n log² n)-size data structure that allows us to count distinct squares in a text fragment T[i..j] in 𝒪(log n) time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.8/LIPIcs.CPM.2020.8.pdf
dictionary matching
internal pattern matching
squares
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
9:1
9:13
10.4230/LIPIcs.CPM.2020.9
article
Dynamic String Alignment
Charalampopoulos, Panagiotis
1
2
https://orcid.org/0000-0002-6024-1557
Kociumaka, Tomasz
3
https://orcid.org/0000-0002-2477-1702
Mozes, Shay
4
https://orcid.org/0000-0001-9262-1821
Department of Informatics, King’s College London, UK
Institute of Informatics, University of Warsaw, Poland
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Efi Arazi School of Computer Science, The Interdisciplinary Center Herzliya, Israel
We consider the problem of dynamically maintaining an optimal alignment of two strings, each of length at most n, as they undergo insertions, deletions, and substitutions of letters. The string alignment problem generalizes the longest common subsequence (LCS) problem and the edit distance problem (also with non-unit costs, as long as insertions and deletions cost the same). The conditional lower bound of Backurs and Indyk [J. Comput. 2018] for computing the LCS in the static case implies that strongly sublinear update time for the dynamic string alignment problem would refute the Strong Exponential Time Hypothesis. We essentially match this lower bound when the alignment weights are constants, by showing how to process each update in 𝒪̃(n) time. When the weights are integers bounded in absolute value by some w=n^{𝒪(1)}, we can maintain the alignment in 𝒪̃(n ⋅ min {√ n,w}) time per update. For the 𝒪̃(nw)-time algorithm, we heavily rely on Tiskin’s work on semi-local LCS, and in particular, in an implicit way, on his algorithm for computing the (min,+)-product of two simple unit-Monge matrices [Algorithmica 2015]. As for the 𝒪̃(n√n)-time algorithm, we employ efficient data structures for computing distances in planar graphs.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.9/LIPIcs.CPM.2020.9.pdf
string alignment
edit distance
longest common subsequence
(unit-)Monge matrices
(min,+)-product
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
10:1
10:12
10.4230/LIPIcs.CPM.2020.10
article
Unary Words Have the Smallest Levenshtein k-Neighbourhoods
Charalampopoulos, Panagiotis
1
2
https://orcid.org/0000-0002-6024-1557
Pissis, Solon P.
3
4
5
https://orcid.org/0000-0002-1445-1932
Radoszewski, Jakub
2
6
https://orcid.org/0000-0002-0067-6401
Waleń, Tomasz
2
https://orcid.org/0000-0002-7369-3309
Zuba, Wiktor
2
https://orcid.org/0000-0002-1988-3507
Department of Informatics, King’s College London, UK
Institute of Informatics, University of Warsaw, Poland
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
ERABLE Team, Lyon, France
Samsung R&D, Warsaw, Poland
The edit distance (a.k.a. the Levenshtein distance) between two words is defined as the minimum number of insertions, deletions or substitutions of letters needed to transform one word into another. The Levenshtein k-neighbourhood of a word w is the set of words that are at edit distance at most k from w. This is perhaps the most important concept underlying BLAST, a widely-used tool for comparing biological sequences. A natural combinatorial question is to ask for upper and lower bounds on the size of this set. The answer to this question has important algorithmic implications as well. Myers notes that "such bounds would give a tighter characterisation of the running time of the algorithm" behind BLAST. We show that the size of the Levenshtein k-neighbourhood of any word of length n over an arbitrary alphabet is not smaller than the size of the Levenshtein k-neighbourhood of a unary word of length n, thus providing a tight lower bound on the size of the Levenshtein k-neighbourhood. We remark that this result was posed as a conjecture by Dufresne at WCTA 2019.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.10/LIPIcs.CPM.2020.10.pdf
combinatorics on words
Levenshtein distance
edit distance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
11:1
11:15
10.4230/LIPIcs.CPM.2020.11
article
Summarizing Diverging String Sequences, with Applications to Chain-Letter Petitions
Commins, Patty
1
2
Liben-Nowell, David
1
Liu, Tina
1
3
Tomlinson, Kiran
1
4
Department of Computer Science, Carleton College, Northfield, MN, USA
Department of Mathematics, University of Minnesota, Minneapolis, MN, USA
Surescripts, Minneapolis, MN, USA
Department of Computer Science, Cornell University, Ithaca, NY, USA
Algorithms to find optimal alignments among strings, or to find a parsimonious summary of a collection of strings, are well studied in a variety of contexts, addressing a wide range of interesting applications. In this paper, we consider chain letters, which contain a growing sequence of signatories added as the letter propagates. The unusual constellation of features exhibited by chain letters (one-ended growth, divergence, and mutation) make their propagation, and thus the corresponding reconstruction problem, both distinctive and rich. Here, inspired by these chain letters, we formally define the problem of computing an optimal summary of a set of diverging string sequences. From a collection of these sequences of names, with each sequence noisily corresponding to a branch of the unknown tree T representing the letter’s true dissemination, can we efficiently and accurately reconstruct a tree T' ≈ T? In this paper, we give efficient exact algorithms for this summarization problem when the number of sequences is small; for larger sets of sequences, we prove hardness and provide an efficient heuristic algorithm. We evaluate this heuristic on synthetic data sets chosen to emulate real chain letters, showing that our algorithm is competitive with or better than previous approaches, and that it also comes close to finding the true trees in these synthetic datasets.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.11/LIPIcs.CPM.2020.11.pdf
edit distance
tree reconstruction
information propagation
chain letters
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
12:1
12:11
10.4230/LIPIcs.CPM.2020.12
article
Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences
Funakoshi, Mitsuru
1
https://orcid.org/0000-0002-2547-1509
Nakashima, Yuto
1
https://orcid.org/0000-0001-6269-9353
Inenaga, Shunsuke
1
2
https://orcid.org/0000-0002-1833-010X
Bannai, Hideo
3
https://orcid.org/0000-0002-6856-5185
Takeda, Masayuki
1
https://orcid.org/0000-0002-6138-1607
Shinohara, Ayumi
4
https://orcid.org/0000-0002-4978-8316
Department of Informatics, Kyushu University, Fukuoka, Japan
PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Graduate School of Information Sciences, Tohoku University, Sendai, Japan
The equidistant subsequence pattern matching problem is considered. Given a pattern string P and a text string T, we say that P is an equidistant subsequence of T if P is a subsequence of the text such that consecutive symbols of P in the occurrence are equally spaced. We can consider the problem of equidistant subsequences as generalizations of (sub-)cadences. We give bit-parallel algorithms that yield o(n²) time algorithms for finding k-(sub-)cadences and equidistant subsequences. Furthermore, O(nlog² n) and O(nlog n) time algorithms, respectively for equidistant and Abelian equidistant matching for the case |P| = 3, are shown. The algorithms make use of a technique that was recently introduced which can efficiently compute convolutions with linear constraints.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.12/LIPIcs.CPM.2020.12.pdf
string algorithms
pattern matching
bit parallelism
subsequences
cadences
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
13:1
13:14
10.4230/LIPIcs.CPM.2020.13
article
FM-Index Reveals the Reverse Suffix Array
Ganguly, Arnab
1
Gibney, Daniel
2
Hooshmand, Sahar
2
Külekci, M. Oğuzhan
3
Thankachan, Sharma V.
2
Department of Computer Science, University of Wisconsin - Whitewater, WI, USA
Department of Computer Science, University of Central Florida, Orlando, FL, USA
Informatics Institute, Istanbul Technical University, Turkey
Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlog n) bits of space compared to the n log σ bits needed to store T itself. A major breakthrough [FM - Index, FOCS'00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in log^{O(1)} n time.
We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM' 09]. Known approaches maintain the FM - Index of both the forward and the reverse text which drives up the space occupancy to 2nlog σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using nlog σ bits (plus lower order terms). We answer this question positively, and show that given the FM - Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM - Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.13/LIPIcs.CPM.2020.13.pdf
Data Structures
Suffix Trees
String Algorithms
Compression
Burrows - Wheeler transform
FM-Index
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
14:1
14:14
10.4230/LIPIcs.CPM.2020.14
article
On Indeterminate Strings Matching
Gawrychowski, Paweł
1
Ghazawi, Samah
2
Landau, Gad M.
2
3
Institute of Computer Science, University of Wrocław, Poland
Department of Computer Science, University of Haifa, Israel
Department of Computer Science and Engineering, NYU Tandon School of Engineering, Brooklyn, NY, USA
Given two indeterminate equal-length strings p and t with a set of characters per position in both strings, we obtain a determinate string p_w from p and a determinate string t_w from t by choosing one character per position. Then, we say that p and t match when p_w and t_w match for some choice of the characters. While the most standard notion of a match for determinate strings is that they are simply identical, in certain applications it is more appropriate to use other definitions, with the prime examples being parameterized matching, order-preserving matching, and the recently introduced Cartesian tree matching. We provide a systematic study of the complexity of string matching for indeterminate equal-length strings, for different notions of matching. We use n to denote the length of both strings, and r to be an upper-bound on the number of uncertain characters per position. First, we provide the first polynomial time algorithm for the Cartesian tree version that runs in deterministic 𝒪(nlog² n) and expected 𝒪(nlog nlog log n) time using 𝒪(nlog n) space, for constant r. Second, we establish NP-hardness of the order-preserving version for r=2, thus solving a question explicitly stated by Henriques et al. [CPM 2018], who showed hardness for r=3. Third, we establish NP-hardness of the parameterized version for r=2. As both parameterized and order-preserving indeterminate matching reduce to the standard determinate matching for r=1, this provides a complete classification for these three variants.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.14/LIPIcs.CPM.2020.14.pdf
string matching
indeterminate strings
Cartesian trees
order-preserving matching
parameterized matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
15:1
15:15
10.4230/LIPIcs.CPM.2020.15
article
The Streaming k-Mismatch Problem: Tradeoffs Between Space and Total Time
Golan, Shay
1
https://orcid.org/0000-0001-8357-2802
Kociumaka, Tomasz
1
https://orcid.org/0000-0002-2477-1702
Kopelowitz, Tsvi
1
https://orcid.org/0000-0002-3525-8314
Porat, Ely
1
https://orcid.org/0000-0001-6912-5766
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
We revisit the k-mismatch problem in the streaming model on a pattern of length m and a streaming text of length n, both over a size-σ alphabet. The current state-of-the-art algorithm for the streaming k-mismatch problem, by Clifford et al. [SODA 2019], uses Õ(k) space and Õ(√k) worst-case time per character. The space complexity is known to be (unconditionally) optimal, and the worst-case time per character matches a conditional lower bound. However, there is a gap between the total time cost of the algorithm, which is Õ(n√k), and the fastest known offline algorithm, which costs Õ(n + min(nk/√m, σn)) time. Moreover, it is not known whether improvements over the Õ(n√k) total time are possible when using more than O(k) space.
We address these gaps by designing a randomized streaming algorithm for the k-mismatch problem that, given an integer parameter k≤s≤m, uses Õ(s) space and costs Õ(n+min(nk²/m, nk/√s, σnm/s)) total time. For s=m, the total runtime becomes Õ(n + min(nk/√m, σn)), which matches the time cost of the fastest offline algorithm. Moreover, the worst-case time cost per character is still Õ(√k).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.15/LIPIcs.CPM.2020.15.pdf
Streaming pattern matching
Hamming distance
k-mismatch
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
16:1
16:15
10.4230/LIPIcs.CPM.2020.16
article
Approximating Longest Common Substring with k mismatches: Theory and Practice
Gourdel, Garance
1
Kociumaka, Tomasz
2
https://orcid.org/0000-0002-2477-1702
Radoszewski, Jakub
3
4
https://orcid.org/0000-0002-0067-6401
Starikovskaya, Tatiana
5
ENS Paris Saclay, France
Bar-Ilan University, Ramat Gan, Israel
Institute of Informatics, University of Warsaw, Poland
Samsung R&D Institute, Warsaw, Poland
DIENS, École normale supérieure, PSL Research University, France
In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length 𝓁 such that there is a length-𝓁 substring of X and a length-𝓁 substring of Y that differ in at most k positions. The length 𝓁 can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing 𝓁 that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.16/LIPIcs.CPM.2020.16.pdf
approximation algorithms
string similarity
LSH
conditional lower bounds
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
17:1
17:14
10.4230/LIPIcs.CPM.2020.17
article
String Factorizations Under Various Collision Constraints
Grüttemeier, Niels
1
https://orcid.org/0000-0002-6789-2918
Komusiewicz, Christian
1
https://orcid.org/0000-0003-0829-7032
Morawietz, Nils
1
Sommer, Frank
1
https://orcid.org/0000-0003-4034-525X
Philipps-Universität Marburg, Fachbereich Mathematik und Informatik, Germany
In the NP-hard Equality-Free String Factorization problem, we are given a string S and ask whether S can be partitioned into k factors that are pairwise distinct. We describe a randomized algorithm for Equality-Free String Factorization with running time 2^k⋅ k^{𝒪(1)}+𝒪(n) improving over previous algorithms with running time k^{𝒪(k)}+𝒪(n) [Schmid, TCS 2016; Mincu and Popa, Proc. SOFSEM 2020]. Our algorithm works for the generalization of Equality-Free String Factorization where equality can be replaced by an arbitrary polynomial-time computable equivalence relation on strings. We also consider two factorization problems to which this algorithm does not apply, namely Prefix-Free String Factorization where we ask for a factorization of size k such that no factor is a prefix of another factor and Substring-Free String Factorization where we ask for a factorization of size k such that no factor is a substring of another factor. We show that these two problems are NP-hard as well. Then, we show that Prefix-Free String Factorization with the prefix-free relation is fixed-parameter tractable with respect to k by providing a polynomial problem kernel. Finally, we show a generic ILP formulation for R-Free String Factorization where R is an arbitrary relation on strings. This formulation improves over a previous one for Equality-Free String Factorization in terms of the number of variables.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.17/LIPIcs.CPM.2020.17.pdf
NP-hard problem
fixed-parameter algorithms
collision-aware string partitioning
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
18:1
18:15
10.4230/LIPIcs.CPM.2020.18
article
k-Approximate Quasiperiodicity under Hamming and Edit Distance
Kędzierski, Aleksander
1
2
https://orcid.org/0000-0002-0335-2963
Radoszewski, Jakub
1
2
https://orcid.org/0000-0002-0067-6401
Institute of Informatics, University of Warsaw, Poland
Samsung R&D Institute, Warsaw, Poland
Quasiperiodicity in strings was introduced almost 30 years ago as an extension of string periodicity. The basic notions of quasiperiodicity are cover and seed. A cover of a text T is a string whose occurrences in T cover all positions of T. A seed of text T is a cover of a superstring of T. In various applications exact quasiperiodicity is still not sufficient due to the presence of errors. We consider approximate notions of quasiperiodicity, for which we allow approximate occurrences in T with a small Hamming, Levenshtein or weighted edit distance.
In previous work Sip et al. (2002) and Christodoulakis et al. (2005) showed that computing approximate covers and seeds, respectively, under weighted edit distance is NP-hard. They, therefore, considered restricted approximate covers and seeds which need to be factors of the original string T and presented polynomial-time algorithms for computing them. Further algorithms, considering approximate occurrences with Hamming distance bounded by k, were given in several contributions by Guth et al. They also studied relaxed approximate quasiperiods that do not need to cover all positions of T.
In case of large data the exponents in polynomial time complexity play a crucial role. We present more efficient algorithms for computing restricted approximate covers and seeds. In particular, we improve upon the complexities of many of the aforementioned algorithms, also for relaxed quasiperiods. Our solutions are especially efficient if the number (or total cost) of allowed errors is bounded. We also show NP-hardness of computing non-restricted approximate covers and seeds under Hamming distance.
Approximate covers were studied in three recent contributions at CPM over the last three years. However, these works consider a different definition of an approximate cover of T, that is, the shortest exact cover of a string T' with the smallest Hamming distance from T.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.18/LIPIcs.CPM.2020.18.pdf
approximate cover
approximate seed
enhanced cover
Hamming distance
edit distance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
19:1
19:15
10.4230/LIPIcs.CPM.2020.19
article
Longest Common Subsequence on Weighted Sequences
Kipouridis, Evangelos
1
https://orcid.org/0000-0002-5830-5830
Tsichlas, Kostas
2
Basic Algorithms Research Copenhagen (BARC), University of Copenhagen, Denmark
Computer Engineering and Informatics Department, University of Patras, Greece
We consider the general problem of the Longest Common Subsequence (LCS) on weighted sequences. Weighted sequences are an extension of classical strings, where in each position every letter of the alphabet may occur with some probability. Previous results presented a PTAS and noticed that no FPTAS is possible unless P=NP. In this paper we essentially close the gap between upper and lower bounds by improving both. First of all, we provide an EPTAS for bounded alphabets (which is the most natural case), and prove that there does not exist any EPTAS for unbounded alphabets unless FPT=W[1]. Furthermore, under the Exponential Time Hypothesis, we provide a lower bound which shows that no significantly better PTAS can exist for unbounded alphabets. As a side note, we prove that it is sufficient to work with only one threshold in the general variant of the problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.19/LIPIcs.CPM.2020.19.pdf
WLCS
LCS
weighted sequences
approximation algorithms
lower bound
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
20:1
20:14
10.4230/LIPIcs.CPM.2020.20
article
Parameterized Algorithms for Matrix Completion with Radius Constraints
Koana, Tomohiro
1
https://orcid.org/0000-0002-8684-0611
Froese, Vincent
1
Niedermeier, Rolf
1
https://orcid.org/0000-0003-1703-1236
Technische Universität Berlin, Faculty IV, Algorithmics and Computational Complexity, Germany
Considering matrices with missing entries, we study NP-hard matrix completion problems where the resulting completed matrix should have limited (local) radius. In the pure radius version, this means that the goal is to fill in the entries such that there exists a "center string" which has Hamming distance to all matrix rows as small as possible. In stringology, this problem is also known as Closest String with Wildcards. In the local radius version, the requested center string must be one of the rows of the completed matrix.
Hermelin and Rozenberg [CPM 2014, TCS 2016] performed a parameterized complexity analysis for Closest String with Wildcards. We answer one of their open questions, fix a bug concerning a fixed-parameter tractability result in their work, and improve some running time upper bounds. For the local radius case, we reveal a computational complexity dichotomy. In general, our results indicate that, although being NP-hard as well, this variant often allows for faster (fixed-parameter) algorithms.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.20/LIPIcs.CPM.2020.20.pdf
fixed-parameter tractability
consensus string problems
Closest String
Closest String with Wildcards
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
21:1
21:15
10.4230/LIPIcs.CPM.2020.21
article
In-Place Bijective Burrows-Wheeler Transforms
Köppl, Dominik
1
2
https://orcid.org/0000-0002-8721-4444
Hashimoto, Daiki
3
Hendrian, Diptarama
3
https://orcid.org/0000-0002-8168-7312
Shinohara, Ayumi
3
https://orcid.org/0000-0002-4978-8316
Department of Informatics, Kyushu University, Fukuoka, Japan
Japan Society for Promotion of Science (JSPS), Tokyo, Japan
Graduate School of Information Sciences, Tohoku University, Sendai, Japan
One of the most well-known variants of the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994] is the bijective BWT (BBWT) [Gil and Scott, arXiv 2012], which applies the extended BWT (EBWT) [Mantaci et al., TCS 2007] to the multiset of Lyndon factors of a given text. Since the EBWT is invertible, the BBWT is a bijective transform in the sense that the inverse image of the EBWT restores this multiset of Lyndon factors such that the original text can be obtained by sorting these factors in non-increasing order.
In this paper, we present algorithms constructing or inverting the BBWT in-place using quadratic time. We also present conversions from the BBWT to the BWT, or vice versa, either (a) in-place using quadratic time, or (b) in the run-length compressed setting using 𝒪(n lg r / lg lg r) time with 𝒪(r lg n) bits of words, where r is the sum of character runs in the BWT and the BBWT.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.21/LIPIcs.CPM.2020.21.pdf
In-Place Algorithms
Burrows-Wheeler transform
Lyndon words
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
22:1
22:15
10.4230/LIPIcs.CPM.2020.22
article
Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms
Lafond, Manuel
1
Zhu, Binhai
2
Zou, Peng
2
Department of Computer Science, Universite de Sherbrooke, Quebec J1K 2R1, Canada
Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
Recently, due to the genomic sequence analysis in several types of cancer, genomic data based on copy number profiles (CNP for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific segment of interest. The motivation is that in the late stage of certain types of cancer, the genomes are progressing rapidly by segmental duplications and deletions, and hence obtaining the exact sequences becomes difficult. Instead, the number of copies of important segments can be predicted from expression analysis and carries important biological information. Therefore, significant research has recently been devoted to the analysis of genomic data represented as CNP’s.
In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. The Minimum Copy Number Generation (MCNG) is defined as follows: given a string S in which each character represents a gene or segment, and a CNP C, compute a string T from S, with the minimum number of segmental duplications and deletions, such that cnp(T)=C. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications and/or deletions are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. This is achieved through a general-purpose lemma on set-cover reductions that require an exact cover in one direction, but not the other, which might be of independent interest. We also prove that the corresponding parameterized version is W[1]-hard, answering another open question by Qingge et al.
The other result is positive and is based on a new (and more general) problem regarding CNP’s. The Copy Number Profile Conforming (CNPC) problem is formally defined as follows: given two CNP’s C₁ and C₂, compute two strings S₁ and S₂ with cnp(S₁)=C₁ and cnp(S₂)=C₂ such that the distance between S₁ and S₂, d(S₁,S₂), is minimized. Here, d(S₁,S₂) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S₁,S₂) is measured by the breakpoint distance then the problem is polynomially solvable. We expect that this will trigger some related research along the line in the near future.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.22/LIPIcs.CPM.2020.22.pdf
Computational genomics
cancer genomics
copy number profiles
NP-hardness
approximation algorithms
FPT algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
23:1
23:13
10.4230/LIPIcs.CPM.2020.23
article
Compressed Orthogonal Search on Suffix Arrays with Applications to Range LCP
Matsuda, Kotaro
1
Sadakane, Kunihiko
1
https://orcid.org/0000-0002-8212-3682
Starikovskaya, Tatiana
2
Tateshita, Masakazu
1
Graduate School of Information Science and Technology, The University of Tokyo, Japan
DIENS, École normale supérieure, PSL Research University, Paris, France
We propose a space-efficient data structure for orthogonal range search on suffix arrays. For general two-dimensional orthogonal range search problem on a set of n points, there exists an n log n (1+o(1))-bit data structure supporting O(log n)-time counting queries [Mäkinen, Navarro 2007]. The space matches the information-theoretic lower bound. However, if we focus on a point set representing a suffix array, there is a chance to obtain a space efficient data structure. We answer this question affirmatively. Namely, we propose a data structure for orthogonal range search on suffix arrays which uses O(1/(ε) n (H₀+1)) bits where H₀ is the order-0 entropy of the string and answers a counting query in O(n^ε) time for any constant ε>0. As an application, we give an O(1/(ε) n (H₀+1))-bit data structure for the range LCP problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.23/LIPIcs.CPM.2020.23.pdf
Orthogonal Range Search
Succinct Data Structure
Suffix Array
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
24:1
24:15
10.4230/LIPIcs.CPM.2020.24
article
Text Indexing and Searching in Sublinear Time
Munro, J. Ian
1
Navarro, Gonzalo
2
Nekrich, Yakov
3
Cheriton School of Computer Science, University of Waterloo, Canada
CeBiB - Center of Biotechnology and Bioengineering, Department of Computer Science, University of Chile, Santiago, Chile
Department of Computer Science, Michigan Technological University, Houghton, MI, USA
We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size σ, our index uses O(n log σ) bits, is built in O(n log σ / √{log n}) deterministic time, and computes the number of occurrences of the pattern in time O(q/log_σ n + log n log_σ n). Each such occurrence can then be found in O(log n) time. Other trade-offs between the space usage and the cost of reporting occurrences are also possible.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.24/LIPIcs.CPM.2020.24.pdf
data structures
string indexes
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
25:1
25:12
10.4230/LIPIcs.CPM.2020.25
article
Chaining with Overlaps Revisited
Mäkinen, Veli
1
https://orcid.org/0000-0003-4454-1493
Sahlin, Kristoffer
2
https://orcid.org/0000-0001-7378-2320
Department of Computer Science, University of Helsinki, Finland
Department of Mathematics, Science for Life Laboratory, Stockholm University, Sweden
Chaining algorithms aim to form a semi-global alignment of two sequences based on a set of anchoring local alignments as input. Depending on the optimization criteria and the exact definition of a chain, there are several O(n log n) time algorithms to solve this problem optimally, where n is the number of input anchors.
In this paper, we focus on a formulation allowing the anchors to overlap in a chain. This formulation was studied by Shibuya and Kurochkin (WABI 2003), but their algorithm comes with no proof of correctness. We revisit and modify their algorithm to consider a strict definition of precedence relation on anchors, adding the required derivation to convince on the correctness of the resulting algorithm that runs in O(n log² n) time on anchors formed by exact matches. With the more relaxed definition of precedence relation considered by Shibuya and Kurochkin or when anchors are non-nested such as matches of uniform length (k-mers), the algorithm takes O(n log n) time.
We also establish a connection between chaining with overlaps and the widely studied longest common subsequence problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.25/LIPIcs.CPM.2020.25.pdf
Sparse Dynamic Programming
Chaining
Maximal Exact Matches
Longest Common Subsequence
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
26:1
26:14
10.4230/LIPIcs.CPM.2020.26
article
DAWGs for Parameterized Matching: Online Construction and Related Indexing Structures
Nakashima, Katsuhito
1
Fujisato, Noriki
2
Hendrian, Diptarama
1
https://orcid.org/0000-0002-8168-7312
Nakashima, Yuto
2
https://orcid.org/0000-0001-6269-9353
Yoshinaka, Ryo
1
https://orcid.org/0000-0002-5175-465X
Inenaga, Shunsuke
2
3
https://orcid.org/0000-0002-1833-010X
Bannai, Hideo
4
https://orcid.org/0000-0002-6856-5185
Shinohara, Ayumi
1
https://orcid.org/0000-0002-4978-8316
Takeda, Masayuki
2
https://orcid.org/0000-0002-6138-1607
Graduate School of Information Sciences, Tohoku University, Sendai, Japan
Department of Informatics, Kyushu University, Fukuoka, Japan
PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Two strings x and y over Σ ∪ Π of equal length are said to parameterized match (p-match) if there is a renaming bijection f:Σ ∪ Π → Σ ∪ Π that is identity on Σ and transforms x to y (or vice versa). The p-matching problem is to look for substrings in a text that p-match a given pattern. In this paper, we propose parameterized suffix automata (p-suffix automata) and parameterized directed acyclic word graphs (PDAWGs) which are the p-matching versions of suffix automata and DAWGs. While suffix automata and DAWGs are equivalent for standard strings, we show that p-suffix automata can have Θ(n²) nodes and edges but PDAWGs have only O(n) nodes and edges, where n is the length of an input string. We also give O(n |Π| log (|Π| + |Σ|))-time O(n)-space algorithm that builds the PDAWG in a left-to-right online manner. As a byproduct, it is shown that the parameterized suffix tree for the reversed string can also be built in the same time and space, in a right-to-left online manner.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.26/LIPIcs.CPM.2020.26.pdf
parameterized matching
suffix trees
DAWGs
suffix automata
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
27:1
27:13
10.4230/LIPIcs.CPM.2020.27
article
On Extensions of Maximal Repeats in Compressed Strings
Pape-Lange, Julian
1
https://orcid.org/0000-0001-6621-8369
Technische Universität Chemnitz, Straße der Nationen 62, 09111 Chemnitz, Germany
This paper provides upper bounds for several subsets of maximal repeats and maximal pairs in compressed strings and also presents a formerly unknown relationship between maximal pairs and the run-length Burrows-Wheeler transform.
This relationship is used to obtain a different proof for the Burrows-Wheeler conjecture which has recently been proven by Kempa and Kociumaka in "Resolution of the Burrows-Wheeler Transform Conjecture".
More formally, this paper proves that the run-length Burrows-Wheeler transform of a string S with z_S LZ77-factors has at most 73(log₂ |S|)(z_S+2)² runs, and if S does not contain q-th powers, the number of arcs in the compacted directed acyclic word graph of S is bounded from above by 18q(1+log_q |S|)(z_S+2)².
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.27/LIPIcs.CPM.2020.27.pdf
Maximal repeats
Extensions of maximal repeats
Combinatorics on compressed strings
LZ77
Burrows-Wheeler transform
Burrows-Wheeler transform conjecture
Compact suffix automata
CDAWGs
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
28:1
28:13
10.4230/LIPIcs.CPM.2020.28
article
Faster Binary Mean Computation Under Dynamic Time Warping
Schaar, Nathan
1
Froese, Vincent
1
Niedermeier, Rolf
1
Technische Universität Berlin, Faculty IV, Algorithmics and Computational Complexity, Germany
Many consensus string problems are based on Hamming distance. We replace Hamming distance by the more flexible (e.g., easily coping with different input string lengths) dynamic time warping distance, best known from applications in time series mining. Doing so, we study the problem of finding a mean string that minimizes the sum of (squared) dynamic time warping distances to a given set of input strings. While this problem is known to be NP-hard (even for strings over a three-element alphabet), we address the binary alphabet case which is known to be polynomial-time solvable. We significantly improve on a previously known algorithm in terms of worst-case running time. Moreover, we also show the practical usefulness of one of our algorithms in experiments with real-world and synthetic data. Finally, we identify special cases solvable in linear time (e.g., finding a mean of only two binary input strings) and report some empirical findings concerning combinatorial properties of optimal means.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.28/LIPIcs.CPM.2020.28.pdf
consensus string problems
time series averaging
minimum 1-separated sum
sparse strings
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2020-06-09
161
29:1
29:11
10.4230/LIPIcs.CPM.2020.29
article
Approximating Text-To-Pattern Distance via Dimensionality Reduction
Uznański, Przemysław
1
https://orcid.org/0000-0002-8652-0490
Institute of Computer Science, University of Wrocław, Poland
Text-to-pattern distance is a fundamental problem in string matching, where given a pattern of length m and a text of length n, over an integer alphabet, we are asked to compute the distance between pattern and the text at every location. The distance function can be e.g. Hamming distance or 𝓁_p distance for some parameter p > 0. Almost all state-of-the-art exact and approximate algorithms developed in the past ∼ 40 years were using FFT as a black-box. In this work we present 𝒪~(n/ε²) time algorithms for (1±ε)-approximation of 𝓁₂ distances, and 𝒪~(n/ε³) algorithm for approximation of Hamming and 𝓁₁ distances, all without use of FFT. This is independent to the very recent development by Chan et al. [STOC 2020], where 𝒪(n/ε²) algorithm for Hamming distances not using FFT was presented - although their algorithm is much more "combinatorial", our techniques apply to other norms than Hamming.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol161-cpm2020/LIPIcs.CPM.2020.29/LIPIcs.CPM.2020.29.pdf
Approximate Pattern Matching
𝓁₂ Distance
𝓁₁ Distance
Hamming Distance
Approximation Algorithms
Combinatorial Algorithms