eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
1
404
10.4230/LIPIcs.CPM.2021
article
LIPIcs, Volume 191, CPM 2021, Complete Volume
Gawrychowski, Paweł
1
https://orcid.org/0000-0002-6993-5440
Starikovskaya, Tatiana
2
University of Wrocław, Poland
École normale supérieure, France
LIPIcs, Volume 191, CPM 2021, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021/LIPIcs.CPM.2021.pdf
LIPIcs, Volume 191, CPM 2021, Complete Volume
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
0:i
0:xiv
10.4230/LIPIcs.CPM.2021.0
article
Front Matter, Table of Contents, Preface, Conference Organization
Gawrychowski, Paweł
1
https://orcid.org/0000-0002-6993-5440
Starikovskaya, Tatiana
2
University of Wrocław, Poland
École normale supérieure, France
Front Matter, Table of Contents, Preface, Conference Organization
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.0/LIPIcs.CPM.2021.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
1:1
1:1
10.4230/LIPIcs.CPM.2021.1
article
Repetitions in Strings: A "Constant" Problem (Invited Talk)
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
M&D Data Science Center, Tokyo Medical and Dental University, Japan
Repeating structures in strings is one of the most fundamental characteristics of strings, and has been an important topic in the field of combinatorics on words and combinatorial pattern matching since their beginnings. In this talk, I will focus on squares and maximal repetitions and review the "runs" theorem [Hideo Bannai et al., 2017] as well as related results (e.g. [Aviezri S. Fraenkel and Jamie Simpson, 1998; Yuta Fujishige et al., 2017; Ryo Sugahara et al., 2019; Philip Bille et al., 2020; Hideo Bannai et al., 2020; Jonas Ellert and Johannes Fischer, 2021]) which address the two main questions: how many of them can be contained in a string of given length, and algorithms for computing them.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.1/LIPIcs.CPM.2021.1.pdf
Maximal repetitions
Squares
Lyndon words
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
2:1
2:1
10.4230/LIPIcs.CPM.2021.2
article
Computing Edit Distance (Invited Talk)
Koucký, Michal
1
https://orcid.org/0000-0003-0808-2269
Computer Science Institute of Charles University, Prague, Czech Republic
The edit distance (or Levenshtein distance) between two strings x, y is the minimum number of character insertions, deletions, and substitutions needed to convert x into y. It has numerous applications in various fields from text processing to bioinformatics so algorithms for edit distance computation attract lot of attention. In this talk I will survey recent progress on computational aspects of edit distance in several contexts: computing edit distance approximately, sketching and computing it in streaming model, exchanging strings in communication complexity model, and building error correcting codes for edit distance. I will point out many problems that are still open in those areas.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.2/LIPIcs.CPM.2021.2.pdf
edit distance
streaming algorithms
approximation algorithms
sketching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
3:1
3:2
10.4230/LIPIcs.CPM.2021.3
article
On-Line Pattern Matching on D-Texts (Invited Talk)
Pisanti, Nadia
1
https://orcid.org/0000-0003-3915-7665
University of Pisa, Italy
The Elastic Degenerate String Matching (EDSM) problem is defined as that of finding an occurrence of a pattern P of length m in an ED-text T. A D-text (Degenerate text) is a string that actually represents a set of similar and aligned strings (e.g. a pan-genome [The Computational Pan-Genomics Consortium, 2018]) by collapsing common fragments into a standard string, and representing variants with sets of alternative substrings. When such substrings are not bound to have the same size, then we talk about elastic D-strings (ED-strings). In [R.Grossi et al., 2017] we gave an O(nm²+N) time on-line algorithm for EDSM, where n is the length of T and N is its size, defined as the total number of letters. A fundamental toolkit of our algorithm is the O(m²+N) time solution of the later called Active Prefixes problem (AP). In [K.Aoyama et al., 2018], a O(m^{1.5} √{log m}+N) solution for AP was shown, leading to a O(nm^{1.5} √{log m}+N) time solution for EDSM. The natural open problem was thus whether the 1.5 exponent could furtherly be decreased. In [G.Bernardini et al., 2019], we prove several properties that answer this and other questions: we give a conditional O(nm^{1.5}+N) lower bound for EDSM, proving that a combinatorial algorithm solving EDSM in O(nm^{1.5-ε} +N) time would break the Boolean Matrix Multiplication (BMM) conjecture; we use this result as a hint to devise a non-combinatorial algorithm that solves EDSM in O(nm^{1.381}+N) time; we do so by successfully combining Fast Fourier Transform and properties of string periodicity. In my talk I will overview the results above, as well as some interesting side results: the extension to a dictionary rather than a single pattern [S.P.Pissis and A.Retha, 2018], the introduction of errors [G.Bernardini et al., 2020], and a notion of matching among D-strings with its linear time solution [M.Alzamel et al., 2020].
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.3/LIPIcs.CPM.2021.3.pdf
pattern matching
elastic-degenerate string
matrix multiplication
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
4:1
4:17
10.4230/LIPIcs.CPM.2021.4
article
Ranking Bracelets in Polynomial Time
Adamson, Duncan
1
Gusev, Vladimir V.
1
Potapov, Igor
2
Deligkas, Argyrios
3
Leverhulme Research Centre for Functional Materials Design, Department of Computer Science, University of Liverpool, UK
Department of Computer Science, University of Liverpool, UK
Department of Computer Science, Royal Holloway University of London, UK
The main result of the paper is the first polynomial-time algorithm for ranking bracelets. The time-complexity of the algorithm is O(k²⋅ n⁴), where k is the size of the alphabet and n is the length of the considered bracelets. The key part of the algorithm is to compute the rank of any word with respect to the set of bracelets by finding three other ranks: the rank over all necklaces, the rank over palindromic necklaces, and the rank over enclosing apalindromic necklaces. The last two concepts are introduced in this paper. These ranks are key components to our algorithm in order to decompose the problem into parts. Additionally, this ranking procedure is used to build a polynomial-time unranking algorithm.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.4/LIPIcs.CPM.2021.4.pdf
Bracelets
Ranking
Necklaces
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
5:1
5:20
10.4230/LIPIcs.CPM.2021.5
article
The k-Mappability Problem Revisited
Amir, Amihood
1
2
Boneh, Itai
1
Kondratovsky, Eitan
3
4
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Georgia Tech, Atlanta, GA, USA
Department of Computer Science, Bar Ilan University, Ramat Gan, Israel
Cheriton School of Computer Science, Waterloo University, Waterloo, Canada
The k-mappability problem has two integers parameters m and k. For every subword of size m in a text S, we wish to report the number of indices in S in which the word occurs with at most k mismatches.
The problem was lately tackled by Alzamel et al. [Mai Alzamel et al., 2018]. For a text with constant alphabet Σ and k ∈ O(1), they present an algorithm with linear space and O(nlog^{k+1}n) time. For the case in which k = 1 and a constant size alphabet, a faster algorithm with linear space and O(nlog(n)log log(n)) time was presented in [Mai Alzamel et al., 2020].
In this work, we enhance the techniques of [Mai Alzamel et al., 2020] to obtain an algorithm with linear space and O(n log(n)) time for k = 1. Our algorithm removes the constraint of the alphabet being of constant size. We also present linear algorithms for the case of k = 1, |Σ| ∈ O(1) and m = Ω(√n).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.5/LIPIcs.CPM.2021.5.pdf
Pattern Matching
Hamming Distance
Suffix Tree
Suffix Array
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
6:1
6:18
10.4230/LIPIcs.CPM.2021.6
article
Internal Shortest Absent Word Queries
Badkobeh, Golnaz
1
https://orcid.org/0000-0001-5550-7149
Charalampopoulos, Panagiotis
2
https://orcid.org/0000-0002-6024-1557
Pissis, Solon P.
3
4
https://orcid.org/0000-0002-1445-1932
Department of Computing, Goldsmiths University of London, UK
Efi Arazi School of Computer Science, The Interdisciplinary Center Herzliya, Israel
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Given a string T of length n over an alphabet Σ ⊂ {1,2,…,n^{𝒪(1)}} of size σ, we are to preprocess T so that given a range [i,j], we can return a representation of a shortest string over Σ that is absent in the fragment T[i]⋯ T[j] of T. For any positive integer k ∈ [1,log log_σ n], we present an 𝒪((n/k)⋅ log log_σ n)-size data structure, which can be constructed in 𝒪(nlog_σ n) time, and answers queries in time 𝒪(log log_σ k).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.6/LIPIcs.CPM.2021.6.pdf
string algorithms
internal queries
shortest absent word
bit parallelism
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
7:1
7:16
10.4230/LIPIcs.CPM.2021.7
article
Constructing the Bijective and the Extended Burrows-Wheeler Transform in Linear Time
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
Kärkkäinen, Juha
2
Köppl, Dominik
1
https://orcid.org/0000-0002-8721-4444
Piątkowski, Marcin
3
https://orcid.org/0000-0001-5636-9497
M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Helsinki Institute of Information Technology (HIIT), Finland
Nicolaus Copernicus University, Toruń, Poland
The Burrows-Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT (BBWT) is a bijective variant of it. Although it is known that the BWT can be constructed in linear time for integer alphabets by using a linear time suffix array construction algorithm, it was up to now only conjectured that the BBWT can also be constructed in linear time. We confirm this conjecture in the word RAM model by proposing a construction algorithm that is based on SAIS, improving the best known result of O(n lg n / lg lg n) time to linear. Since we can reduce the problem of constructing the extended BWT to constructing the BBWT in linear time, we obtain a linear-time algorithm computing the extended BWT at the same time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.7/LIPIcs.CPM.2021.7.pdf
Burrows-Wheeler Transform
Lyndon words
Circular Suffix Array
Suffix Array Construction Algorithm
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
8:1
8:15
10.4230/LIPIcs.CPM.2021.8
article
Weighted Ancestors in Suffix Trees Revisited
Belazzougui, Djamal
1
Kosolobov, Dmitry
2
https://orcid.org/0000-0002-2909-2952
Puglisi, Simon J.
3
https://orcid.org/0000-0001-7668-7636
Raman, Rajeev
4
https://orcid.org/0000-0001-9942-8290
Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
Institute of Natural Sciences and Mathematics, Ural Federal University, Ekaterinburg, Russia
Department of Computer Science, University of Helsinki, Helsinki, Finland
Department of Informatics, University of Leicester, Leicester, United Kingdom
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.8/LIPIcs.CPM.2021.8.pdf
suffix tree
weighted ancestors
irreducible LCP
deterministic substring hashing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
9:1
9:18
10.4230/LIPIcs.CPM.2021.9
article
Constructing Strings Avoiding Forbidden Substrings
Bernardini, Giulia
1
https://orcid.org/0000-0001-6647-088X
Marchetti-Spaccamela, Alberto
2
3
Pissis, Solon P.
1
4
3
https://orcid.org/0000-0002-1445-1932
Stougie, Leen
1
4
3
Sweering, Michelle
1
CWI, Amsterdam, The Netherlands
Dept. of Computer, Automatic and Management Engineering, Sapienza University of Rome, Italy
ERABLE Team, Lyon, France
Vrije Universiteit, Amsterdam, The Netherlands
We consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set S_k of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ∈ S_k occurs in x. Our first result is an 𝒪(|u|+|v|+k|S_k|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ∈ S occurs in x. Our second result is an 𝒪(|u|+|v|+||S||⋅|Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S.
Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths.
Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set S_k of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no s ∈ S_k occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an 𝒪(nk|S_k|⋅|Σ|)-time algorithm to solve this problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.9/LIPIcs.CPM.2021.9.pdf
string algorithms
forbidden strings
de Bruijn graphs
data sanitization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
10:1
10:19
10.4230/LIPIcs.CPM.2021.10
article
Gapped Indexing for Consecutive Occurrences
Bille, Philip
1
https://orcid.org/0000-0002-1120-5154
Gørtz, Inge Li
1
https://orcid.org/0000-0002-8322-4952
Pedersen, Max Rishøj
1
https://orcid.org/0000-0002-8850-6422
Steiner, Teresa Anna
1
https://orcid.org/0000-0003-1078-4075
Technical University of Denmark, DTU Compute, Lyngby, Denmark
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P₁ and P₂ and a gap range [α, β] we can quickly find the consecutive occurrences of P₁ and P₂ with distance in [α, β], i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use Õ(n) space and query time Õ(|P₁|+|P₂|+n^{2/3}) for existence and counting and Õ(|P₁|+|P₂|+n^{2/3}occ^{1/3}) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using Õ(n) space must use Ω̃(|P₁| + |P₂| + √n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.10/LIPIcs.CPM.2021.10.pdf
String indexing
two patterns
consecutive occurrences
conditional lower bound
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
11:1
11:15
10.4230/LIPIcs.CPM.2021.11
article
Disorders and Permutations
Bulteau, Laurent
1
Giraudo, Samuele
1
https://orcid.org/0000-0003-3878-371X
Vialette, Stéphane
1
https://orcid.org/0000-0003-2308-6970
LIGM, Univ Gustave Eiffel, CNRS, F-77454 Marne-la-Vallée, France
The additive x-disorder of a permutation is the sum of the absolute differences of all pairs of consecutive elements. We show that the additive x-disorder of a permutation of S(n), n ≥ 2, ranges from n-1 to ⌊n²/2⌋ - 1, and we give a complete characterization of permutations having extreme such values. Moreover, for any positive integers n and d such that n ≥ 2 and n-1 ≤ d ≤ ⌊n²/2⌋ - 1, we propose a linear-time algorithm to compute a permutation π ∈ S(n) with additive x-disorder d.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.11/LIPIcs.CPM.2021.11.pdf
Permutation
Algorithm
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
12:1
12:20
10.4230/LIPIcs.CPM.2021.12
article
Computing Covers of 2D-Strings
Charalampopoulos, Panagiotis
1
https://orcid.org/0000-0002-6024-1557
Radoszewski, Jakub
2
https://orcid.org/0000-0002-0067-6401
Rytter, Wojciech
2
https://orcid.org/0000-0002-9162-6724
Waleń, Tomasz
2
https://orcid.org/0000-0002-7369-3309
Zuba, Wiktor
2
https://orcid.org/0000-0002-1988-3507
The Interdisciplinary Center Herzliya, Israel
University of Warsaw, Poland
We consider two notions of covers of a two-dimensional string T. A (rectangular) subarray P of T is a 2D-cover of T if each position of T is in an occurrence of P in T. A one-dimensional string S is a 1D-cover of T if its vertical and horizontal occurrences in T cover all positions of T. We show how to compute the smallest-area 2D-cover of an m × n array T in the optimal 𝒪(N) time, where N = mn, all aperiodic 2D-covers of T in 𝒪(N log N) time, and all 2D-covers of T in N^{4/3}⋅ log^{𝒪(1)}N time. Further, we show how to compute all 1D-covers in the optimal 𝒪(N) time. Along the way, we show that the Klee’s measure of a set of rectangles, each of width and height at least √n, on an n × n grid can be maintained in √n⋅ log^{𝒪(1)}n time per insertion or deletion of a rectangle, a result which could be of independent interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.12/LIPIcs.CPM.2021.12.pdf
2D-string
cover
dynamic Klee’s measure problem
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
13:1
13:16
10.4230/LIPIcs.CPM.2021.13
article
A Fast and Small Subsampled R-Index
Cobas, Dustin
1
2
https://orcid.org/0000-0001-6081-694X
Gagie, Travis
1
3
https://orcid.org/0000-0003-3689-327X
Navarro, Gonzalo
1
2
https://orcid.org/0000-0002-2286-741X
CeBiB - Center for Biotechnology and Bioengineering, Santiago, Chile
Dept. of Computer Science, University of Chile, Santiago, Chile
Dalhousie University, Halifax, Canada
The r-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, 𝒪(r) where r is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the sr-index, a variant that limits a large fraction of the space to 𝒪(min(r,n/s)) for a text of length n and a given parameter s, at the expense of multiplying by s the time per occurrence reported. The sr-index is obtained by carefully subsampling the text positions indexed by the r-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the sr-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the r-index while using 1.5-3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the sr-index, using about half the space, but they are an order of magnitude slower.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.13/LIPIcs.CPM.2021.13.pdf
Pattern matching
r-index
compressed text indexing
repetitive text collections
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
14:1
14:15
10.4230/LIPIcs.CPM.2021.14
article
The Longest Run Subsequence Problem: Further Complexity Results
Dondi, Riccardo
1
https://orcid.org/0000-0002-6124-2965
Sikora, Florian
2
https://orcid.org/0000-0003-2670-6258
Università degli Studi di Bergamo, Bergamo, Italy
Université Paris-Dauphine, PSL University, CNRS, LAMSADE, 75016 Paris, France
Longest Run Subsequence is a problem introduced recently in the context of the scaffolding phase of genome assembly (Schrinner et al., WABI 2020). The problem asks for a maximum length subsequence of a given string that contains at most one run for each symbol (a run is a maximum substring of consecutive identical symbols). The problem has been shown to be NP-hard and to be fixed-parameter tractable when the parameter is the size of the alphabet on which the input string is defined. In this paper we further investigate the complexity of the problem and we show that it is fixed-parameter tractable when it is parameterized by the number of runs in a solution, a smaller parameter. Moreover, we investigate the kernelization complexity of Longest Run Subsequence and we prove that it does not admit a polynomial kernel when parameterized by the size of the alphabet or by the number of runs. Finally, we consider the restriction of Longest Run Subsequence when each symbol has at most two occurrences in the input string and we show that it is APX-hard.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.14/LIPIcs.CPM.2021.14.pdf
Parameterized complexity
Kernelization
Approximation Hardness
Longest Subsequence
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
15:1
15:17
10.4230/LIPIcs.CPM.2021.15
article
Data Structures for Categorical Path Counting Queries
He, Meng
1
Kazi, Serikzhan
1
Faculty of Computer Science, Dalhousie University, Halifax, Canada
Consider an ordinal tree T on n nodes, each of which is assigned a category from an alphabet [σ] = {1,2,…,σ}. We preprocess the tree T in order to support {categorical path counting queries}, which ask for the number of distinct categories occurring on the path in T between two query nodes x and y. For this problem, we propose a linear-space data structure with query time O(√n lg((lg σ)/(lg w))), where w = Ω(lg n) is the word size in the word-RAM. As shown in our proof, from the assumption that matrix multiplication cannot be solved in time faster than cubic (with only combinatorial methods), our result is optimal, save for polylogarithmic speed-ups. For a trade-off parameter 1 ≤ t ≤ n, we propose an O(n+ n²/t²)-word, O(t lg ((lg σ)/(lg w))) query time data structure. We also consider c-approximate categorical path counting queries, which must return an approximation to the number of distinct categories occurring on the query path, by counting each such category at least once and at most c times. We describe a linear-space data structure that supports 2-approximate categorical path counting queries in O((lg n)/(lg lg n)) time.
Next, we generalize the categorical path counting queries to weighted trees. Here, a query specifies two nodes x,y and an orthogonal range Q. The answer to thus formed categorical path range counting query is the number of distinct categories occurring on the path from x to y, if only the nodes with weights falling inside Q are considered. We propose an O(n lg lg n +(n/t)⁴)-word data structure with O(t lg lg n) query time, or an O(n+(n/t)⁴)-word} data structure with O(t lg^ε n) query time. For an appropriate choice of the trade-off parameter t, this implies a linear-space data structure with O(n^{3/4} lg^ε n) query time. We then extend the approach to the trees weighted with vectors from [n]^{d}, where d is a constant integer greater than or equal to 2. We present a data structure with O(n lg^{d-1+ε} n + (n/t)^{2d+2}) words of space and O(t (lg^{d-1} n)/((lg lg n)^{d-2})) query time. For an O(n⋅polylog n)-space solution, one thus has O(n^{{2d+1}/{2d+2}}⋅polylog n) query time.
The inherent difficulty revealed by the lower bound we proved motivated us to consider data structures based on {sketching}. In unweighted trees, we propose a sketching data structure to solve the approximate categorical path counting problem which asks for a (1±ε)-approximation (i.e. within 1±ε of the true answer) of the number of distinct categories on the given path, with probability 1-δ, where 0 < ε,δ < 1 are constants. The data structure occupies O(n+n/t lg n) words of space, for the query time of O(t lg n). For trees weighted with d-dimensional weight vectors (d ≥ 1), we propose a data structure with O((n + n/t lg n) lg^d n) words of space and O(t lg^{d+1} n) query time.
All these problems generalize the corresponding categorical range counting problems in Euclidean space ℝ^{d+1}, for respective d, by replacing one of the dimensions with a tree topology.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.15/LIPIcs.CPM.2021.15.pdf
data structures
weighted trees
path queries
categorical queries
coloured queries
categorical path counting
categorical path range counting
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
16:1
16:16
10.4230/LIPIcs.CPM.2021.16
article
Compressed Weighted de Bruijn Graphs
Italiano, Giuseppe F.
1
2
https://orcid.org/0000-0002-9492-9894
Prezza, Nicola
3
https://orcid.org/0000-0003-3553-4953
Sinaimeri, Blerina
1
2
https://orcid.org/0000-0002-9797-7592
Venturini, Rossano
4
https://orcid.org/0000-0002-9830-3936
Luiss University, Rome, Italy
Erable, INRIA Grenoble Rhône-Alpes, France
DAIS, Ca' Foscari University of Venice, Italy
Dipartimento di Informatica, Università di Pisa, Pisa, Italy
We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.16/LIPIcs.CPM.2021.16.pdf
weighted de Bruijn graphs
k-mer annotation
compressed data structures
partial sums
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
17:1
17:11
10.4230/LIPIcs.CPM.2021.17
article
Optimal Construction of Hierarchical Overlap Graphs
Khan, Shahbaz
1
https://orcid.org/0000-0001-9352-0088
University of Helsinki, Finland
Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches.
For a given set P of n strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring O(||P||+n²) time using superlinear space, where ||P|| is the cumulative sum of the lengths of strings in P. This was improved by Park et al. [SPIRE20] to O(||P||log n) time and O(||P||) space using segment trees, and further to O(||P||(log n)/(log log n)) for the word RAM model. Both these results described an open problem to compute HOG in optimal O(||P||) time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures. At its core, our solution improves the classical result [IPL92] for a special case of the All Pairs Suffix Prefix (APSP) problem from O(||P||+n²) time to optimal O(||P||) time, which may be of independent interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.17/LIPIcs.CPM.2021.17.pdf
Hierarchical Overlap Graphs
String algorithms
Genome assembly
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
18:1
18:19
10.4230/LIPIcs.CPM.2021.18
article
A Compact Index for Cartesian Tree Matching
Kim, Sung-Hwan
1
Cho, Hwan-Gue
1
Pusan National University, South Korea
Cartesian tree matching is a recently introduced string matching problem in which two strings match if their corresponding Cartesian trees are the same. It is considered appropriate to find patterns regarding their shapes especially in numerical time series data. While many related problems have been addressed, developing a compact index has received relatively less attention. In this paper, we present a 3n+o(n)-bit index that can count the number of occurrences of a Cartesian tree pattern in 𝒪(m) time where n and m are the text and pattern length. To the best of our knowledge, this work is the first 𝒪(n)-bit compact data structure for indexing for this problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.18/LIPIcs.CPM.2021.18.pdf
String Matching
Suffix Array
FM-index
Compact Index
Cartesian Tree Matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
19:1
19:18
10.4230/LIPIcs.CPM.2021.19
article
String Sanitization Under Edit Distance: Improved and Generalized
Mieno, Takuya
1
2
https://orcid.org/0000-0003-2922-9434
Pissis, Solon P.
3
4
https://orcid.org/0000-0002-1445-1932
Stougie, Leen
3
4
Sweering, Michelle
3
Kyushu University, Fukuoka, Japan
Japan Society for the Promotion of Science, Tokyo, Japan
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_ED such that: (i) no string of 𝒮 occurs in X_ED; (ii) the order of all other length-k substrings over Σ (and thus the frequency) is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in 𝒪(n²k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:
- An 𝒪(n²log²k)-time algorithm to solve ETFS.
- An 𝒪(n²log²n)-time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of 𝒮 can have arbitrary lengths. Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails.
In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in 𝒪̃(n²) total time. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.19/LIPIcs.CPM.2021.19.pdf
string algorithms
data sanitization
edit distance
dynamic programming
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
20:1
20:14
10.4230/LIPIcs.CPM.2021.20
article
An Invertible Transform for Efficient String Matching in Labeled Digraphs
Nellore, Abhinav
1
https://orcid.org/0000-0001-8145-1484
Nguyen, Austin
1
https://orcid.org/0000-0001-7940-4830
Thompson, Reid F.
1
2
https://orcid.org/0000-0003-3661-5296
Oregon Health & Science University, Portland, Oregon 97239, USA
VA Portland Healthcare System, Portland, Oregon 97239, USA
Let G = (V, E) be a digraph where each vertex is unlabeled, each edge is labeled by a character in some alphabet Ω, and any two edges with both the same head and the same tail have different labels. The powerset construction gives a transform of G into a weakly connected digraph G' = (V', E') that enables solving the decision problem of whether there is a walk in G matching an arbitrarily long query string q in time linear in |q| and independent of |E| and |V|. We show G is uniquely determined by G' when for every v_𝓁 ∈ V, there is some distinct string s_𝓁 on Ω such that v_𝓁 is the origin of a closed walk in G matching s_𝓁, and no other walk in G matches s_𝓁 unless it starts and ends at v_𝓁. We then exploit this invertibility condition to strategically alter any G so its transform G' enables retrieval of all t terminal vertices of walks in the unaltered G matching q in O(|q| + t log |V|) time. We conclude by proposing two defining properties of a class of transforms that includes the Burrows-Wheeler transform and the transform presented here.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.20/LIPIcs.CPM.2021.20.pdf
pattern matching
string matching
Burrows-Wheeler transform
labeled graphs
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
21:1
21:21
10.4230/LIPIcs.CPM.2021.21
article
R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
Nishimoto, Takaaki
1
Tabei, Yasuo
1
RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.21/LIPIcs.CPM.2021.21.pdf
Enumeration algorithm
Burrows-Wheeler transform
Maximal repeats
Minimal unique substrings
Minimal absent words
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
22:1
22:9
10.4230/LIPIcs.CPM.2021.22
article
A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs
Park, Sangsoo
1
https://orcid.org/0000-0002-6593-4336
Park, Sung Gwan
1
https://orcid.org/0000-0002-3255-9752
Cazaux, Bastien
2
https://orcid.org/0000-0002-1761-4354
Park, Kunsoo
3
https://orcid.org/0000-0001-5225-0907
Rivals, Eric
2
https://orcid.org/0000-0003-3791-3973
Samsung Electronics, Seoul, Korea
LIRMM, Université Montpellier, CNRS, Montpellier, France
Seoul National University, Seoul, Korea
The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.22/LIPIcs.CPM.2021.22.pdf
overlap graph
hierarchical overlap graph
shortest superstring problem
border array
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
23:1
23:13
10.4230/LIPIcs.CPM.2021.23
article
Efficient Algorithms for Counting Gapped Palindromes
Popa, Andrei
1
Popa, Alexandru
1
https://orcid.org/0000-0003-3364-1210
Department of Computer Science, University of Bucharest, Romania
A gapped palindrome is a string uvu^{R}, where u^{R} represents the reverse of string u. In this paper we show three efficient algorithms for counting the occurrences of gapped palindromes in a given string S of length N. First, we present a solution in O(N) time for counting all gapped palindromes without additional constraints. Then, in the case where the length of v is constrained to be in an interval [g, G], we show an algorithm with running time O(N log N). Finally, we show an algorithm in O(N log² N) time for a more general case where we count gapped palindromes uvu^{R}, where u^{R} starts at position i with g(i) ≤ v ≤ G(i), for all positions i.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.23/LIPIcs.CPM.2021.23.pdf
pattern matching
gapped palindromes
suffix tree
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
24:1
24:21
10.4230/LIPIcs.CPM.2021.24
article
AWLCO: All-Window Length Co-Occurrence
Sobel, Joshua
1
Bertram, Noah
2
Ding, Chen
1
Nargesian, Fatemeh
1
Gildea, Daniel
1
Department of Computer Science, University of Rochester, NY, USA
Department of Mathematics, University of Rochester, NY, USA
Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity O(n) for each window length and thus a total complexity of O(n²) and the space complexity O(|I|) for a sequence of size n and an itemset of size |I|. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the time complexity of O(n) and space complexity of O(√{n|I|}), assuming perfect hashing. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with time complexity O(n|I|), assuming perfect hashing, with an additional pre-processing step and space complexity O(√{n|I|}+|I|), plus the overhead of the Aho-Corasick algorithm [Aho and Corasick, 1975].
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.24/LIPIcs.CPM.2021.24.pdf
Itemsets
Data Sequences
Co-occurrence
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2021-06-30
191
25:1
25:23
10.4230/LIPIcs.CPM.2021.25
article
Optimal Completion and Comparison of Incomplete Phylogenetic Trees Under Robinson-Foulds Distance
Yao, Keegan
1
Bansal, Mukul S.
1
Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, USA
The comparison of phylogenetic trees is a fundamental task in phylogenetics and evolutionary biology. In many cases, these comparisons involve trees inferred on the same set of leaves, and many distance measures exist to facilitate such comparisons. However, several applications in phylogenetics require the comparison of trees that have non-identical leaf sets. The traditional approach for handling such comparisons is to first restrict the two trees being compared to just their common leaf set. An alternative, conceptually superior approach that has shown promise is to first complete the trees by adding missing leaves so that the completed trees have identical leaf sets. This alternative approach requires the computation of optimal completions of the two trees that minimize the distance between them. However, no polynomial-time algorithms currently exist for this optimal completion problem under any standard phylogenetic distance measure.
In this work, we provide the first polynomial-time algorithms for the above problem under the widely used Robinson-Foulds (RF) distance measure. This hitherto unsolved problem is referred to as the RF(+) problem. We (i) show that a recently proposed linear-time algorithm for a restricted version of the RF(+) problem is a 2-approximation for the RF(+) problem, and (ii) provide an exact O(nk²)-time algorithm for the RF(+) problem, where n is the total number of distinct leaf labels in the two trees being compared and k, bounded above by n, depends on the topologies and leaf set overlap of the two trees. Our results hold for both rooted and unrooted binary trees.
We implemented our exact algorithm and applied it to several biological datasets. Our results show that completion-based RF distance can lead to very different inferences regarding phylogenetic similarity compared to traditional RF distance. An open-source implementation of our algorithms is freely available from https://compbio.engr.uconn.edu/software/RF_plus.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol191-cpm2021/LIPIcs.CPM.2021.25/LIPIcs.CPM.2021.25.pdf
Phylogenetic tree comparison
Robinson-Foulds Distance
Optimal tree completion
Algorithms
Dynamic programming