eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
0
0
10.4230/LIPIcs.CPM.2017
article
LIPIcs, Volume 78, CPM'17, Complete Volume
Kärkkäinen, Juha
Radoszewski, Jakub
Rytter, Wojciech
LIPIcs, Volume 78, CPM'17, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017/LIPIcs.CPM.2017.pdf
Data Structures, Data Storage Representations, Coding and Information Theory, Theory of Computation, Discrete Mathematics, Information Systems,
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
0:i
0:xvi
10.4230/LIPIcs.CPM.2017.0
article
Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers
Kärkkäinen, Juha
Radoszewski, Jakub
Rytter, Wojciech
Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.0/LIPIcs.CPM.2017.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
External Reviewers
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
1:1
1:1
10.4230/LIPIcs.CPM.2017.1
article
Wheeler Graphs: Variations on a Theme by Burrows and Wheeler
Manzini, Giovanni
The famous Burrows-Wheeler Transform was originally defined for single strings but variations have been developed for sets of strings, labelled trees, de Bruijn graphs, alignments, etc. In this talk we propose a unifying view that includes many of these variations and that we hope will simplify the search for more.
Somewhat surprisingly we get our unifying view by considering the Nondeterministic Finite Automata related to different pattern-matching problems. We show that the state graphs associated with these automata have common properties that we summarize with the concept of a Wheeler graph. Using the notion of a Wheeler graph, we show that it is possible to process strings efficiently even if the automaton is nondeterministic. In addition, we show that Wheeler graphs can be compactly represented and traversed using up to three arrays with additional data structures supporting efficient rank and select operations. It turns out that these arrays coincide with, or are substantially equivalent to, the output of many Burrows-Wheeler Transform variants described in the literature.
This is joint work with Travis Gagie and Jouni Sirén.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.1/LIPIcs.CPM.2017.1.pdf
compressed data structures
pattern matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
2:1
2:1
10.4230/LIPIcs.CPM.2017.2
article
Recompression of SLPs
Jez, Artur
In this talk I will survey the recompression technique in case of SLPs. The technique is based on applying simple compression operations (replacement of pairs of two different letters by a new letter and replacement of maximal repetition of a letter by a new symbol) to strings represented by SLPs. To this end we modify the SLPs, so that performing such compression operations on SLPs is possible. For instance, when we want to replace ab in the string and SLP has a production X to aY and the string generated by Y is bw, then we alter the rule of Y so that it generates w and replace Y with bY in all rules. In this way the rule becomes X to abY and so ab can be replaced, similar operations are defined for the right sides of the nonterminals. As a result, we are interested mostly in the SLP representation rather than the string itself and its combinatorial properties. What we need to control, though, is the size of the SLP. With appropriate choices of substrings to be compressed it can be shown that it stays linear.
The proposed method turned out to be surprisingly efficient and applicable in various scenarios: for instance it can be used to test the equality of SLPs in time O(n log N), where n is the size of the SLP and N the length of the generated string; on the other hand it can be used to approximate the smallest SLP for a given string, with the approximation ratio O(log(n/g)) where n is the length of the string and g the size of the smallest SLP for this string, matching the best known bounds.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.2/LIPIcs.CPM.2017.2.pdf
Straight Line Programs
smallest grammar problem
compression
pro- cessing compressed data
recompression
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
3:1
3:1
10.4230/LIPIcs.CPM.2017.3
article
Shortest Superstring
Mucha, Marcin
In the Shortest Superstring problem (SS) one has to find a shortest string s containing given strings s_1,...,s_n as substrings. The problem is NP-hard, so a natural question is that of its approximability.
One natural approach to approximately solving SS is the following GREEDY heuristic: repeatedly merge two strings with the largest overlap until only a single string is left. This heuristic is conjectured to be a 2-approximation, but even after 30 years since the conjecture has been posed, we are still very far from proving it. The situation is better for non-greedy approximation algorithms, where several approaches yielding 2.5-approximation (and better) are known.
In this talk, we will survey the main results in the area, focusing on the fundamental ideas and intuitions.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.3/LIPIcs.CPM.2017.3.pdf
shortest superstring
approximation algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
4:1
4:13
10.4230/LIPIcs.CPM.2017.4
article
Document Listing on Repetitive Collections with Guaranteed Performance
Navarro, Gonzalo
We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,a] is composed of D copies of a string of size n, and s single-character edits are applied on the copies. We introduce the first document listing index with size O~(n + s), precisely O((n lg a + s lg^2 N) lg D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc strings where it appears in time O(m^2 + m lg N (lg D + lg^e N) ndoc), for any constant e > 0.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.4/LIPIcs.CPM.2017.4.pdf
repetitive string collections
document listing
grammar compression
range minimum queries
succinct data structures
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
5:1
5:15
10.4230/LIPIcs.CPM.2017.5
article
Path Queries on Functions
Gagie, Travis
He, Meng
Navarro, Gonzalo
Let f : [1..n] -> [1..n] be a function, and l : [1..n] -> [1..s] indicate a label assigned to each element of the domain. We design several compact data structures that answer various queries on the labels of paths in f. For example, we can find the minimum label in f^k (i) for a given i and any k >= 0 in a given range [k1..k2], using n lg n + O(n) bits, or the minimum label in f^(-k) (i) for a given i and k > 0, using 2n lg n + O(n) bits, both in time O(lg n/ lg lg n). By using n lg s + o(n lg s) further bits, we can also count, within the same time, the number of elements within a range of labels, and report each such element in O(1 + lg s / lg lg n) additional time. Several other possible queries are considered, such as top-t queries and t-majorities.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.5/LIPIcs.CPM.2017.5.pdf
succinct data structures
integer functions
range queries
trees and permutations
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
6:1
6:11
10.4230/LIPIcs.CPM.2017.6
article
Deterministic Indexing for Packed Strings
Bille, Philip
Gørtz, Inge Li
Skjoldjensen, Frederik Rye
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In the deterministic variant the goal is to solve the string indexing problem without any randomization (at preprocessing time or query time). In the packed variant the strings are stored with several character in a single word, giving us the opportunity to read multiple characters simultaneously. Our main result is a new string index in the deterministic and packed setting. Given a packed string S of length n over an alphabet s, we show how to preprocess S in O(n) (deterministic) time and space O(n) such that given a packed pattern string of length m we can support queries in (deterministic) time O(m/a + log m + log log s), where a = w /log s is the number of characters packed in a word of size w = log n. Our query time is always at least as good as the previous best known bounds and whenever several characters are packed in a word, i.e., log s << w, the query times are faster.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.6/LIPIcs.CPM.2017.6.pdf
suffix tree
suffix array
deterministic algorithm
word packing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
7:1
7:13
10.4230/LIPIcs.CPM.2017.7
article
Representing the Suffix Tree with the CDAWG
Belazzougui, Djamal
Cunial, Fabio
Given a string T, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with e_T arcs, taking overall O(e_T+e_REV(T)) words of space, where REV(T) is the reverse of T, and supporting some key operations in time between O(1) and O(log(log(n))) in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which e_T grows sublinearly in the length of T in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between O(1) and O(log(n)) in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of T itself, that takes O(e_T) words of space, and that supports random access in O(log(n)) time. Furthermore, we establish a connection between the reversed CDAWG of T and a context-free grammar that produces T and only T, which might have independent interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.7/LIPIcs.CPM.2017.7.pdf
CDAWG
suffix tree
heavy path decomposition
maximal repeat
context-free grammar
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
8:1
8:13
10.4230/LIPIcs.CPM.2017.8
article
Position Heaps for Parameterized Strings
Diptarama, Diptarama
Katsura, Takashi
Otomo, Yuhei
Narisawa, Kazuyuki
Shinohara, Ayumi
We propose a new indexing structure for parameterized strings, called parameterized position heap. Parameterized position heap is applicable for parameterized pattern matching problem, where the pattern matches a substring of the text if there exists a bijective mapping from the symbols of the pattern to the symbols of the substring. We propose an online construction algorithm of parameterized position heap of a text and show that our algorithm runs in linear time with respect to the text size. We also show that by using parameterized position heap, we can find all occurrences of a pattern in the text in linear time with respect to the product of the pattern size and the alphabet size.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.8/LIPIcs.CPM.2017.8.pdf
string matching
indexing structure
parameterized pattern matching
position heap
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
9:1
9:14
10.4230/LIPIcs.CPM.2017.9
article
On-Line Pattern Matching on Similar Texts
Grossi, Roberto
Iliopoulos, Costas S.
Liu, Chang
Pisanti, Nadia
Pissis, Solon P.
Retha, Ahmad
Rosone, Giovanna
Vayani, Fatima
Versari, Luca
Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.9/LIPIcs.CPM.2017.9.pdf
string algorithms
pattern matching
degenerate strings
elastic-degenerate strings
on-line algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
10:1
10:14
10.4230/LIPIcs.CPM.2017.10
article
A Family of Approximation Algorithms for the Maximum Duo-Preservation String Mapping Problem
Dudek, Bartlomiej
Gawrychowski, Pawel
Ostropolski-Nalewaja, Piotr
In the Maximum Duo-Preservation String Mapping problem we are given two strings and wish to map the letters of the former to the letters of the latter as to maximise the number of duos. A duo is a pair of consecutive letters that is mapped to a pair of consecutive letters in the same order. This is complementary to the well-studied Minimum Common String Partition problem, where the goal is to partition the former string into blocks that can be permuted and concatenated to obtain the latter string.
Maximum Duo-Preservation String Mapping is APX-hard. After a series of improvements, Brubach [WABI 2016] showed a polynomial-time 3.25-approximation algorithm. Our main contribution is that, for any eps>0, there exists a polynomial-time (2+eps)-approximation algorithm. Similarly to a previous solution by Boria et al. [CPM 2016], our algorithm uses the local search technique. However, this is used only after a certain preliminary greedy procedure, which gives us more structure and makes a more general local search possible. We complement this with a specialised version of the algorithm that achieves 2.67-approximation in quadratic time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.10/LIPIcs.CPM.2017.10.pdf
approximation scheme
minimum common string partition
local search
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
11:1
11:17
10.4230/LIPIcs.CPM.2017.11
article
Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping
Komusiewicz, Christian
de Oliveira Oliveira, Mateus
Zehavi, Meirav
In the Maximum-Duo Preservation String Mapping (Max-Duo PSM) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in time 4^k * n^{O(1)}, and a deterministic algorithm that solves this problem in time 6.855^k * n^{O(1)}. The previous best known (deterministic) algorithm for this problem has running time (8e)^{2k+o(k)} * n^{O(1)} [Beretta et al., Theor. Comput. Sci. 2016]. We also show that Max-Duo PSM admits a problem kernel of size O(k^3), improving upon the previous best known problem kernel of size O(k^6).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.11/LIPIcs.CPM.2017.11.pdf
comparative genomics
parameterized complexity
kernelization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
12:1
12:14
10.4230/LIPIcs.CPM.2017.12
article
Clique-Based Lower Bounds for Parsing Tree-Adjoining Grammars
Bringmann, Karl
Wellnitz, Philip
Tree-adjoining grammars are a generalization of context-free grammars that are well suited to model human languages and are thus popular in computational linguistics. In the tree-adjoining grammar recognition problem, given a grammar G and a string s of length n, the task is to decide whether s can be obtained from G. Rajasekaran and Yooseph’s parser (JCSS’98) solves this problem in time O(n^2w), where w < 2.373 is the matrix multiplication exponent. The best algorithms avoiding fast matrix multiplication take time O(n^6). The first evidence for hardness was given by Satta (J. Comp. Linguist.’94): For a more general parsing problem, any algorithm that avoids fast matrix multiplication and is significantly faster than O(|G|·n^6) in the case of |G| = Theta(n^12) would imply a breakthrough for Boolean matrix multiplication. Following an approach by Abboud et al. (FOCS’15) for context-free grammar recognition, in this paper we resolve many of the disadvantages of the previous lower bound. We show that, even on constant-size grammars, any improvement on Rajasekaran and Yooseph’s parser would imply a breakthrough for the k-Clique problem. This establishes tree-adjoining grammar parsing as a practically relevant problem with the unusual running time of n^2w , up to lower order factors.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.12/LIPIcs.CPM.2017.12.pdf
conditional lower bounds
k-Clique
parsing
tree-adjoining grammars
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
13:1
13:11
10.4230/LIPIcs.CPM.2017.13
article
Communication and Streaming Complexity of Approximate Pattern Matching
Starikovskaya, Tatiana
We consider the approximate pattern matching problem. Given a text T of length 2n and a pattern P of length n, the task is to decide for each prefix T[1, j] of T if it ends with a string that is at the edit distance at most k from P. If this is the case, we must output the edit distance and the corresponding edit operations. We first show the communication complexity of the problem for the case when Alice and Bob both share the pattern and Alice holds the first half of the text and Bob the second half, and for the case when Alice holds the first half of the text, Bob the second half of the text, and Charlie the pattern. We then develop the first sublinear-space streaming algorithm for the problem. The algorithm is randomised with error probability at most 1/poly(n).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.13/LIPIcs.CPM.2017.13.pdf
approximate pattern matching
edit distance
randomised algorithms
streaming algorithms
communication complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
14:1
14:13
10.4230/LIPIcs.CPM.2017.14
article
The Longest Filled Common Subsequence Problem
Castelli, Mauro
Dondi, Riccardo
Mauri, Giancarlo
Zoppis, Italo
Inspired by a recent approach for genome reconstruction from incomplete data, we consider a variant of the longest common subsequence problem for the comparison of two sequences, one of which is incomplete, i.e. it has some missing elements. The new combinatorial problem, called Longest Filled Common Subsequence, given two sequences A and B, and a multiset M of symbols missing in B, asks for a sequence B* obtained by inserting the symbols of M into B so that B* induces a common subsequence with A of maximum length. First, we investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard when A contains at most two occurrences of each symbol. Then, we give a 3/5 approximation algorithm for the problem. Finally, we present a fixed-parameter algorithm, when the problem is parameterized by the number of symbols inserted in B that "match" symbols of A.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.14/LIPIcs.CPM.2017.14.pdf
longest common subsequence
approximation algorithms
computational complexity
fixed-parameter algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
15:1
15:11
10.4230/LIPIcs.CPM.2017.15
article
Lempel-Ziv Compression in a Sliding Window
Bille, Philip
Cording, Patrick Hagge
Fischer, Johannes
Gørtz, Inge Li
We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem.
Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size.
To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.15/LIPIcs.CPM.2017.15.pdf
Lempel-Ziv parsing
sliding window
rightmost matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
16:1
16:17
10.4230/LIPIcs.CPM.2017.16
article
Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing
Bille, Philip
Ettienne, Mikko Berggren
Gørtz, Inge Li
Vildhøj, Hjalte Wedel
Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in
(i) O(m + occ lglg n) time using O(z lg(n/z) lglg z) space, or
(ii) O(m(1 + lg^e z / lg(n/z)) + occ(lglg n + lg^e z)) time using O(z lg(n/z)) space, for any 0 < e < 1
In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lglg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lg^e z / lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n^{1-d}), for constant d > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.16/LIPIcs.CPM.2017.16.pdf
compressed indexing
pattern matching
LZ77
prefix search
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
17:1
17:10
10.4230/LIPIcs.CPM.2017.17
article
From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back
Policriti, Alberto
Prezza, Nicola
The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes z and r closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output. Let n be the text length. We show that RLBWT can be converted to LZ77 in O(n log r) time and O(r) words of working space. Conversely, we provide an algorithm to convert LZ77 to RLBWT in O(n(log r + log z)) time and O(r+z) words of working space. Note that r and z can be constant if the text is highly repetitive, and our algorithms can operate with (up to) exponentially less space than naive solutions based on full decompression.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.17/LIPIcs.CPM.2017.17.pdf
Lempel-Ziv
Burrows-Wheeler transform
compressed computation
repetitive text collections
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
18:1
18:15
10.4230/LIPIcs.CPM.2017.18
article
Longest Common Extensions with Recompression
I, Tomohiro
Given two positions i and j in a string T of length N, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at i and j. A compressed LCE data structure stores T in a compressed form while supporting fast LCE queries. In this article we show that the recompression technique is a powerful tool for compressed LCE data structures. We present a new compressed LCE data structure of size O(z lg (N/z)) that supports LCE queries in O(lg N) time, where z is the size of Lempel-Ziv 77 factorization without self-reference of T. Given T as an uncompressed form, we show how to build our data structure in O(N) time and space. Given T as a grammar compressed form, i.e., a straight-line program of size n generating T, we show how to build our data structure in O(n lg (N/n)) time and O(n + z lg (N/z)) space. Our algorithms are deterministic and always return correct answers.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.18/LIPIcs.CPM.2017.18.pdf
Longest Common Extension (LCE) queries
compressed data structure
grammar compressed strings
Straight-Line Program (SLP)
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
19:1
19:9
10.4230/LIPIcs.CPM.2017.19
article
Fast and Simple Jumbled Indexing for Binary Run-Length Encoded Strings
Cunha, Luís
Dantas, Simone
Gagie, Travis
Wittler, Roland
Kowada, Luis
Stoye, Jens
Important papers have appeared recently on the problem of indexing binary strings for jumbled pattern matching, and further lowering the time bounds in terms of the input size would now be a breakthrough with broad implications. We can still make progress on the problem, however, by considering other natural parameters. Badkobeh et al. (IPL, 2013) and Amir et al. (TCS, 2016) gave algorithms that index a binary string in O(n + r^2 log r) time, where n is the length and r is the number of runs, and Giaquinta and Grabowski (IPL, 2013) gave one that runs in O(n + r^2) time. In this paper we propose a new and very simple algorithm that also runs in O(n + r^2) time and can be extended either so that the index returns the position of a match (if there is one), or so that the algorithm uses only O(n) bits of space instead of O(n) words.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.19/LIPIcs.CPM.2017.19.pdf
string algorithms
indexing
jumbled pattern matching
run-length encoding
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
20:1
20:12
10.4230/LIPIcs.CPM.2017.20
article
Faster STR-IC-LCS Computation via RLE
Kuboi, Keita
Fujishige, Yuta
Inenaga, Shunsuke
Bannai, Hideo
Takeda, Masayuki
The constrained LCS problem asks one to find a longest common subsequence of two input strings A and B with some constraints. The STR-IC-LCS problem is a variant of the constrained LCS problem, where the solution must include a given constraint string C as a substring. Given two strings A and B of respective lengths M and N, and a constraint string C of length at most min{M, N}, the best known algorithm for the STR-IC-LCS problem, proposed by Deorowicz (Inf. Process. Lett., 11:423-426, 2012), runs in O(MN) time. In this work, we present an O(mN + nM)-time solution to the STR-IC-LCS problem, where m and n denote the sizes of the run-length encodings of A and B, respectively. Since m <= M and n <= N always hold, our algorithm is always as fast as Deorowicz's algorithm, and is faster when input strings are compressible via RLE.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.20/LIPIcs.CPM.2017.20.pdf
longest common subsequence
STR-IC-LCS
run-length encoding
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
21:1
21:12
10.4230/LIPIcs.CPM.2017.21
article
Gapped Pattern Statistics
Duchon, Philippe
Nicaud, Cyril
Pivoteau, Carine
We give a probabilistic analysis of parameters related to alpha-gapped repeats and palindromes in random words, under both uniform and memoryless distributions (where letters have different probabilities, but are drawn independently). More precisely, we study the expected number of maximal alpha-gapped patterns, as well as the expected length of the longest alpha-gapped pattern in a random word.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.21/LIPIcs.CPM.2017.21.pdf
combinatorics on words
alpha-gapped repeats
random words
memoryless sources
analytic combinatorics
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
22:1
22:18
10.4230/LIPIcs.CPM.2017.22
article
Computing All Distinct Squares in Linear Time for Integer Alphabets
Bannai, Hideo
Inenaga, Shunsuke
Köppl, Dominik
Given a string on an integer alphabet, we present an algorithm that computes the set of all distinct squares belonging to this string in time linear to the string length. As an application, we show how to compute the tree topology of the minimal augmented suffix tree in linear time. Asides from that, we elaborate an algorithm computing the longest previous table in a succinct representation using compressed working space.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.22/LIPIcs.CPM.2017.22.pdf
tandem repeats
distinct squares
counting algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
23:1
23:12
10.4230/LIPIcs.CPM.2017.23
article
Palindromic Length in Linear Time
Borozdin, Kirill
Kosolobov, Dmitry
Rubinchik, Mikhail
Shur, Arseny M.
Palindromic length of a string is the minimum number of palindromes whose concatenation is equal to this string. The problem of finding the palindromic length drew some attention, and a few O(n log n) time online algorithms were recently designed for it. In this paper we present the first linear time online algorithm for this problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.23/LIPIcs.CPM.2017.23.pdf
palindrome
palindromic length
palindromic factorization
online
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
24:1
24:11
10.4230/LIPIcs.CPM.2017.24
article
Tight Bounds on the Maximum Number of Shortest Unique Substrings
Mieno, Takuya
Inenaga, Shunsuke
Bannai, Hideo
Takeda, Masayuki
A substring Q of a string S is called a shortest unique substring (SUS) for interval [s,t] in S, if Q occurs exactly once in S, this occurrence of Q contains interval [s,t], and every substring of S which contains interval [s,t] and is shorter than Q occurs at least twice in S. The SUS problem is, given a string S, to preprocess S so that for any subsequent query interval [s,t] all the SUSs for interval [s,t] can be answered quickly. When s = t, we call the SUSs for [s, t] as point SUSs, and when s <= t, we call the SUSs for [s, t] as interval SUSs. There exist optimal O(n)-time preprocessing scheme which answers queries in optimal O(k) time for both point and interval SUSs, where n is the length of S and k is the number of outputs for a given query. In this paper, we reveal structural, combinatorial properties underlying the SUS problem: Namely, we show that the number of intervals in S that correspond to point SUSs for all query positions in S is less than 1.5n, and show that this is a matching upper and lower bound. Also, we consider the maximum number of intervals in S that correspond to interval SUSs for all query intervals in S.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.24/LIPIcs.CPM.2017.24.pdf
shortest unique substrings
maximal unique substrings
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
25:1
25:15
10.4230/LIPIcs.CPM.2017.25
article
Can We Recover the Cover?
Amir, Amihood
Levy, Avivit
Lewenstein, Moshe
Lubin, Ronit
Porat, Benny
Data analysis typically involves error recovery and detection of regularities as two different key tasks. In this paper we show that there are data types for which these two tasks can be powerfully combined. A common notion of regularity in strings is that of a cover. Data describing measures of a natural coverable phenomenon may be corrupted by errors caused by the measurement process, or by the inexact features of the phenomenon itself. Due to this reason, different variants of approximate covers have been introduced, some of which are NP-hard to compute. In this paper we assume that the Hamming distance metric measures the amount of corruption experienced, and study the problem of recovering the correct cover from data corrupted by mismatch errors, formally defined as the cover recovery problem (CRP). We show that for the Hamming distance metric, coverability is a powerful property allowing detecting the original cover and correcting the data, under suitable conditions.
We also study a relaxation of another problem, which is called the approximate cover problem (ACP). Since the ACP is proved to be NP-hard [Amir,Levy,Lubin,Porat, CPM 2017], we study a relaxation, which we call the candidate-relaxation of the ACP, and show it has a polynomial time complexity. As a result, we get that the ACP also has a polynomial time complexity in many practical situations. An important application of our ACP relaxation study is also a polynomial time algorithm for the cover recovery problem (CRP).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.25/LIPIcs.CPM.2017.25.pdf
periodicity
quasi-periodicity
cover
approximate cover
data recovery
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
26:1
26:14
10.4230/LIPIcs.CPM.2017.26
article
Approximate Cover of Strings
Amir, Amihood
Levy, Avivit
Lubin, Ronit
Porat, Ely
Regularities in strings arise in various areas of science, including coding and automata theory, formal language theory, combinatorics, molecular biology and many others. A common notion to describe regularity in a string T is a cover, which is a string C for which every letter of T lies within some occurrence of C. The alignment of the cover repetitions in the given text is called a tiling. In many applications finding exact repetitions is not sufficient, due to the presence of errors. In this paper, we use a new approach for handling errors in coverable phenomena and define the approximate cover problem (ACP), in which we are given a text that is a sequence of some cover repetitions with possible mismatch errors, and we seek a string that covers the text with the minimum number of errors. We first show that the ACP is NP-hard, by studying the cover-size relaxation of the ACP, in which the requested size of the approximate cover is also given with the input string. We show this relaxation is already NP-hard. We also study another two relaxations of the ACP, which we call the partial-tiling relaxation of the ACP and the full-tiling relaxation of the ACP, in which a tiling of the requested cover is also given with the input string. A given full tiling retains all the occurrences of the cover before the errors, while in a partial tiling there can be additional occurrences of the cover that are not marked by the tiling. We show that the partial-tiling relaxation has a polynomial time complexity and give experimental evidence that the full-tiling also has polynomial time complexity. The study of these relaxations, besides shedding another light on the complexity of the ACP, also involves a deep understanding of the properties of covers, yielding some key lemmas and observations that may be helpful for a future study of regularities in the presence of errors.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.26/LIPIcs.CPM.2017.26.pdf
periodicity
quasi-periodicity
cover
approximate cover
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
27:1
27:17
10.4230/LIPIcs.CPM.2017.27
article
Beyond Adjacency Maximization: Scaffold Filling for New String Distances
Bulteau, Laurent
Fertin, Guillaume
Komusiewicz, Christian
In Genomic Scaffold Filling, one aims at polishing in silico a draft genome, called scaffold. The scaffold is given in the form of an ordered set of gene sequences, called contigs. This is done by confronting the scaffold to an already complete reference genome from a close species. More precisely, given a scaffold S, a reference genome G and a score function f() between two genomes, the aim is to complete S by adding the missing genes from G so that the obtained complete genome S* optimizes f(S*, G). In this paper, we extend a model of Jiang et al. [CPM 2016] (i) by allowing the insertions of strings instead of single characters (i.e., some groups of genes may be forced to be inserted together) and (ii) by considering two alternative score functions: the first generalizes the notion of common adjacencies by maximizing the number of common k-mers between S* and G (k-Mer Scaffold Filling), the second aims at minimizing the number of breakpoints between S* and G (Min-Breakpoint Scaffold Filling). We study these problems from the parameterized complexity point of view, providing fixed-parameter (FPT) algorithms for both problems. In particular, we show that k-Mer Scaffold Filling is FPT wrt. parameter l, the number of additional k-mers realized by the completion of S—this answers an open question of Jiang et al. [CPM 2016]. We also show that Min-Breakpoint Scaffold Filling is FPT wrt. a parameter combining the number of missing genes, the number of gene repetitions and the target distance.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.27/LIPIcs.CPM.2017.27.pdf
computational biology
strings
FPT algorithms
kernelization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
28:1
28:18
10.4230/LIPIcs.CPM.2017.28
article
On the Weighted Quartet Consensus Problem
Lafond, Manuel
Scornavacca, Celine
In phylogenetics, the consensus problem consists in summarizing a set of phylogenetic trees that all classify the same set of species into a single tree. Several definitions of consensus exist in the literature; in this paper we focus on the Weighted Quartet Consensus problem, a problem with unknown complexity status so far. Here we prove that the Weighted Quartet Consensus problem is NP-hard and we give a 1/2-factor approximation for this problem. During the process, we propose a derandomization procedure of a previously known randomized 1/3-factor approximation. We also investigate the fixed-parameter tractability of this problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.28/LIPIcs.CPM.2017.28.pdf
phylogenetic tree
consensus tree
quartets
complexity
fixed-parameter tractability
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
29:1
29:12
10.4230/LIPIcs.CPM.2017.29
article
Optimal Omnitig Listing for Safe and Complete Contig Assembly
Cairo, Massimo
Medvedev, Paul
Obscura Acosta, Nidia
Rizzi, Romeo
Tomescu, Alexandru I.
Genome assembly is the problem of reconstructing a genome sequence from a set of reads from a sequencing experiment. Typical formulations of the assembly problem admit in practice many genomic reconstructions, and actual genome assemblers usually output contigs, namely substrings that are promised to occur in the genome. To bridge the theory and practice, Tomescu and Medvedev [RECOMB 2016] reformulated contig assembly as finding all substrings common to all genomic reconstructions. They also gave a characterization of those walks (omnitigs) that are common to all closed edge-covering walks of a (directed) graph, a typical notion of genomic reconstruction. An algorithm for listing all maximal omnitigs was also proposed, by launching an exhaustive visit from every edge.
In this paper, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Omega(nm). We implement this algorithm and show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev [RECOMB 2016].
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.29/LIPIcs.CPM.2017.29.pdf
genome assembly
graph algorithm
edge-covering walk
strong bridge
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
30:1
30:14
10.4230/LIPIcs.CPM.2017.30
article
Dynamic Elias-Fano Representation
Pibiri, Giulio Ermanno
Venturini, Rossano
We show that it is possible to store a dynamic ordered set S of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and preserve, at the same time, the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of monotone integer sequences, which can be shown to be less than half a bit per element away from the information-theoretic minimum.
In particular, considering a RAM model with memory word size Theta(log u) bits, when integers are drawn from a polynomial universe of size u = n^gamma for any gamma = Theta(1), we add o(n) bits to the static Elias-Fano representation in order to:
1. support static predecessor/successor queries in O(min{1+log(u/n), loglog n});
2. make S grow in an append-only fashion by spending O(1) per inserted element;
3. describe a dynamic data structure supporting random access in O(log n / loglog n) worst-case, insertions/deletions in O(log n / loglog n) amortized and predecessor/successor queries in O(min{1+log(u/n), loglog n}) worst-case time. These time bounds are optimal.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.30/LIPIcs.CPM.2017.30.pdf
succinct data structures
integer sets
predecessor problem
Elias-Fano
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2017-06-30
78
31:1
31:14
10.4230/LIPIcs.CPM.2017.31
article
Synergistic Solutions on MultiSets
Barbay, Jérémy
Ochoa, Carlos
Satti, Srinivasa Rao
Karp et al. (1988) described Deferred Data Structures for Multisets as "lazy" data structures which partially sort data to support online rank and select queries, with the minimum amount of work in the worst case over instances of size n and number of queries q fixed. Barbay et al. (2016) refined this approach to take advantage of the gaps between the positions hit by the queries (i.e., the structure in the queries). We develop new techniques in order to further refine this approach and take advantage all at once of the structure (i.e., the multiplicities of the elements), some notions of local order (i.e., the number and sizes of runs) and global order (i.e., the number and positions of existing pivots) in the input; and of the structure and order in the sequence of queries. Our main result is a synergistic deferred data structure which outperforms all solutions in the comparison model that take advantage of only a subset of these features. As intermediate results, we describe two new synergistic sorting algorithms, which take advantage of some notions of structure and order (local and global) in the input, improving upon previous results which take advantage only of the structure (Munro and Spira 1979) or of the local order (Takaoka 1997) in the input; and one new multiselection algorithm which takes advantage of not only the order and structure in the input, but also of the structure in the queries.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol078-cpm2017/LIPIcs.CPM.2017.31/LIPIcs.CPM.2017.31.pdf
deferred data structure
multivariate analysis
quick sort
select