eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
0
0
10.4230/LIPIcs.CPM.2019
article
LIPIcs, Volume 128, CPM'19, Complete Volume
Pisanti, Nadia
1
https://orcid.org/0000-0003-3915-7665
P. Pissis, Solon
2
https://orcid.org/0000-0002-1445-1932
University of Pisa, Italy
CWI Amsterdam, the Netherlands
LIPIcs, Volume 128, CPM'19, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019/LIPIcs.CPM.2019.pdf
Mathematics of computing, Discrete mathematics, Applied computing, Computational biology, Information theory, Information systems
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
0:i
0:xviii
10.4230/LIPIcs.CPM.2019.0
article
Front Matter, Table of Contents, Preface, Conference Organization
Pisanti, Nadia
1
https://orcid.org/0000-0003-3915-7665
P. Pissis, Solon
2
https://orcid.org/0000-0002-1445-1932
University of Pisa, Italy
CWI Amsterdam, the Netherlands
Front Matter, Table of Contents, Preface, Conference Organization
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.0/LIPIcs.CPM.2019.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
1:1
1:1
10.4230/LIPIcs.CPM.2019.1
article
How to Exploit Periodicity (Invited Talk)
Gawrychowski, Paweł
1
Institute of Computer Science, University of Wrocław, Poland
Periodicity is a fundamental combinatorial property of strings. We say that p is a period of a string s[1..n] when s[i]=s[i+p] for every i such that both s[i] and s[i+p] are defined. While this notion is interesting on its own, it can be often used as a tool for designing efficient algorithms. At a high level, such algorithms often operate differently depending on whether a given string does or does not have a small period, where small usually means smaller than half of its length (or, say, quarter). In other words, we design an algorithm that is efficient if the given string is repetitive, and another algorithm that is efficient if the given string is non-repetitive, in every case carefully exploiting either the periodicity or the fact that input looks sufficiently “random”, and then choose the appropriate algorithm depending on the input. Of course, in some cases, one needs to proceed in a more complex manner, for example by classifying the whole string look at its substrings and process each of them differently depending on its structure.
I will survey results, mostly connected to different version of pattern matching, that are based on this paradigm. This will include the recent generalization of periodicity that can be applied in approximate pattern matching, and some examples of how the notion of periodicity can be applied to design a better data structure.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.1/LIPIcs.CPM.2019.1.pdf
periodicity
pattern matching
Hamming distance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
2:1
2:14
10.4230/LIPIcs.CPM.2019.2
article
Some Variations on Lyndon Words (Invited Talk)
Dolce, Francesco
1
Restivo, Antonio
2
Reutenauer, Christophe
3
IRIF, Université Paris Diderot, France
Dipartimento di Matematica e Informatica, Università degli Studi di Palermo, Italy
LaCIM, Université du Québec À Montréal, Canada
In this paper we compare two finite words u and v by the lexicographical order of the infinite words u^omega and v^omega. Informally, we say that we compare u and v by the infinite order. We show several properties of Lyndon words expressed using this infinite order. The innovative aspect of this approach is that it allows to take into account also non trivial conditions on the prefixes of a word, instead that only on the suffixes. In particular, we derive a result of Ufnarovskij [V. Ufnarovskij, Combinatorial and asymptotic methods in algebra, 1995] that characterizes a Lyndon word as a word which is greater, with respect to the infinite order, than all its prefixes. Motivated by this result, we introduce the prefix standard permutation of a Lyndon word and the corresponding (left) Cartesian tree. We prove that the left Cartesian tree is equal to the left Lyndon tree, defined by the left standard factorization of Viennot [G. Viennot, Algèbres de Lie libres et monoïdes libres, 1978]. This result is dual with respect to a theorem of Hohlweg and Reutenauer [C. Hohlweg and C. Reutenauer, Lyndon words, permutations and trees, 2003].
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.2/LIPIcs.CPM.2019.2.pdf
Lyndon words
Infinite words
Left Lyndon trees
Left Cartesian trees
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
3:1
3:1
10.4230/LIPIcs.CPM.2019.3
article
Stringology Combats Microbiological Threats (Invited Talk)
Ziv-Ukelson, Michal
1
Ben Gurion University of the Negev, Israel
A major concern worldwide is the acquisition of antibiotic resistance by pathogenic bacteria. Genomic elements carrying resistance and virulence function can be acquired through horizontal gene transfer, yielding a broad spread of evolutionary successful elements, both within and in between species, with devastating effect. Recent advances in pyrosequencing techniques, combined with global efforts to study microbial adaptation to a wide range of ecological niches (and in particular to life in host tissues that we perceive as pathogenesis), yield huge and rapidly-growing databases of microbial genomes.
This big new data statistically empowers genomic-context based approaches to functional analysis: the idea is that groups of genes that are clustered locally together across many genomes usually express protein products that interact in the same biological pathway, and thus the function of a new, uncharacterized gene can be deciphered based on the previously characterized genes that are co-localized with it in the same gene cluster. Identifying and interpreting microbial gene context in huge genomic data requires efficient string-based data mining algorithms. Additionally, new computational challenges are raised by the need to study the grammar and evolutionary spreading patterns of microbial gene context.
In this talk, we will review some classical combinatorial pattern matching and data mining problems, previously inspired by this application domain. We will re-examine the biological assumptions behind the previously proposed models in light of some new biological observations. We will consider the computational challenges arising in accomodating the new biological observations, and in exploiting them to scale up the algorithmic solutions to the huge new data. Our goal is to inspire interesting new problems that harness Stringology to the study of microbial adaptation and to the fight against microbiological threats ...
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.3/LIPIcs.CPM.2019.3.pdf
comparative genomics
syntenic blocks
gene clusters
reconciliation of gene and species trees
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
4:1
4:12
10.4230/LIPIcs.CPM.2019.4
article
Optimal Rank and Select Queries on Dictionary-Compressed Text
Prezza, Nicola
1
https://orcid.org/0000-0003-3553-4953
Department of Computer Science, University of Pisa, Italy
We study the problem of supporting queries on a string S of length n within a space bounded by the size gamma of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/gamma)/log log n) time within O(gamma polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.4/LIPIcs.CPM.2019.4.pdf
Rank
Select
Dictionary compression
String Attractors
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
5:1
5:13
10.4230/LIPIcs.CPM.2019.5
article
A 2-Approximation Algorithm for the Complementary Maximal Strip Recovery Problem
Jiang, Haitao
1
Guo, Jiong
1
Zhu, Daming
1
Zhu, Binhai
2
Department of Computer Science and Technology, Shandong University, China
Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
The Maximal Strip Recovery problem (MSR) and its complementary (CMSR) are well-studied NP-hard problems in computational genomics. The input of these dual problems are two signed permutations. The goal is to delete some gene markers from both permutations, such that, in the remaining permutations, each gene marker has at least one common neighbor. Equivalently, the resulting permutations could be partitioned into common strips of length at least two. Then MSR is to maximize the number of remaining genes, while the objective of CMSR is to delete the minimum number of gene markers. In this paper, we present a new approximation algorithm for the Complementary Maximal Strip Recovery (CMSR) problem. Our approximation factor is 2, improving the currently best 7/3-approximation algorithm. Although the improvement on the factor is not huge, the analysis is greatly simplified by a compensating method, commonly referred to as the non-oblivious local search technique. In such a method a substitution may not always increase the value of the current solution (it sometimes may even decrease the solution value), though it always improves the value of another function seemingly unrelated to the objective function.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.5/LIPIcs.CPM.2019.5.pdf
Maximal strip recovery
complementary maximal strip recovery
computational genomics
approximation algorithm
local search
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
6:1
6:12
10.4230/LIPIcs.CPM.2019.6
article
Sufficient Conditions for Efficient Indexing Under Different Matchings
Amir, Amihood
1
Kondratovsky, Eitan
1
Department of Computer Science, Bar-Ilan University, Israel
The most important task derived from the massive digital data accumulation in the world, is efficient access to this data, hence the importance of indexing. In the last decade, many different types of matching relations were defined, each requiring an efficient indexing scheme. Cole and Hariharan in a ground breaking paper [Cole and Hariharan, SIAM J. Comput., 33(1):26–42, 2003], formulate sufficient conditions for building an efficient indexing for quasi-suffix collections, collections that behave as suffixes. It was shown that known matchings, including parameterized, 2-D array and order preserving matchings, fit their indexing settings. In this paper, we formulate more basic sufficient conditions based on the order relation derived from the matching relation itself, our conditions are more general than the previously known conditions.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.6/LIPIcs.CPM.2019.6.pdf
off-the-shelf indexing algorithms
general matching relations
weaker sufficient conditions for indexing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
7:1
7:18
10.4230/LIPIcs.CPM.2019.7
article
Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform
Prezza, Nicola
1
https://orcid.org/0000-0003-3553-4953
Rosone, Giovanna
1
https://orcid.org/0000-0001-5075-1214
Department of Computer Science, University of Pisa, Italy
We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1,sigma] can be computed from the Burrows-Wheeler transformed collection in O(n log sigma) time using o(n log sigma) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.7/LIPIcs.CPM.2019.7.pdf
Burrows-Wheeler Transform
LCP array
DNA reads
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
8:1
8:16
10.4230/LIPIcs.CPM.2019.8
article
Safe and Complete Algorithms for Dynamic Programming Problems, with an Application to RNA Folding
Kiirala, Niko
1
Salmela, Leena
1
https://orcid.org/0000-0002-0756-543X
Tomescu, Alexandru I.
1
https://orcid.org/0000-0002-5747-8350
Department of Computer Science and Helsinki Institute for Information Technology HIIT, University of Helsinki, Finland
Many bioinformatics problems admit a large number of solutions, with no way of distinguishing the correct one among them. One approach of coping with this issue is to look at the partial solutions common to all solutions. Such partial solutions have been called safe, and an algorithm outputting all safe solutions has been called safe and complete. In this paper we develop a general technique that automatically provides a safe and complete algorithm to problems solvable by dynamic programming. We illustrate it by applying it to the bioinformatics problem of RNA folding, assuming the simplistic folding model maximizing the number of paired bases. Our safe and complete algorithm has time complexity O(n^3M(n)) and space complexity O(n^3) where n is the length of the RNA sequence and M(n) in Omega(n) is the time complexity of arithmetic operations on O(n)-bit integers. We also implement this algorithm and show that, despite an exponential number of optimal solutions, our algorithm is efficient in practice.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.8/LIPIcs.CPM.2019.8.pdf
RNA secondary structure
RNA folding
Safe solution
Safe and complete algorithm
Counting problem
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
9:1
9:12
10.4230/LIPIcs.CPM.2019.9
article
Conversion from RLBWT to LZ77
Nishimoto, Takaaki
1
Tabei, Yasuo
1
RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string into Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Policriti and Prezza’s conversion algorithm [Algorithmica 2018] were O(n log r) time and O(r) working space for length of the string n, number of runs r in the RLBWT, and number of LZ77 phrases z. Recent results with Kempa’s conversion algorithm [SODA 2019] are O(n / log n + r log^{9} n + z log^{9} n) time and O(n / log_{sigma} n + r log^{8} n) working space for the alphabet size sigma of the RLBWT. In this paper, we present a new conversion algorithm by improving Policriti and Prezza’s conversion algorithm where dynamic data structures for general purpose are used. We argue that these dynamic data structures can be replaced and present new data structures for faster conversion. The time and working space of our conversion algorithm with new data structures are O(n min{log log n, sqrt{(log r)/(log log r)}}) and O(r), respectively.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.9/LIPIcs.CPM.2019.9.pdf
Burrows-Wheeler Transform
Lempel-Ziv Parsing
Lossless Data Compression
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
10:1
10:15
10.4230/LIPIcs.CPM.2019.10
article
Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs
Belazzougui, Djamal
1
Cunial, Fabio
2
3
CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG), Dresden, Germany
Center for Systems Biology Dresden (CSBD), Dresden, Germany
Given a string T on an alphabet of size sigma, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log sigma) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.10/LIPIcs.CPM.2019.10.pdf
BWT
suffix tree
CDAWG
de Bruijn graph
maximal repeat
string depth
contraction
bidirectional index
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
11:1
11:18
10.4230/LIPIcs.CPM.2019.11
article
Entropy Lower Bounds for Dictionary Compression
Gańczorz, Michał
1
Institute of Computer Science, University of Wrocław, Poland
We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.11/LIPIcs.CPM.2019.11.pdf
compression
empirical entropy
parsing
lower bounds
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
12:1
12:12
10.4230/LIPIcs.CPM.2019.12
article
A New Class of Searchable and Provably Highly Compressible String Transformations
Giancarlo, Raffaele
1
https://orcid.org/0000-0002-6286-8871
Manzini, Giovanni
2
3
https://orcid.org/0000-0002-5047-0196
Rosone, Giovanna
4
https://orcid.org/0000-0001-5075-1214
Sciortino, Marinella
1
https://orcid.org/0000-0001-6928-0168
University of Palermo, Dipartimento di Matematica e Informatica, Italy
University of Eastern Piedmont, Alessandria, Italy
IIT-CNR, Pisa, Italy
University of Pisa, Dipartimento di Informatica, Italy
The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.12/LIPIcs.CPM.2019.12.pdf
Data Indexing and Compression
Burrows-Wheeler Transformation
Combinatorics on Words
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
13:1
13:14
10.4230/LIPIcs.CPM.2019.13
article
Compressed Multiple Pattern Matching
Kosolobov, Dmitry
1
Sivukhin, Nikita
2
University of Helsinki, Helsinki, Finland
Ural Federal University, Ekaterinburg, Russia
Given d strings over the alphabet {0,1,...,sigma{-}1}, the classical Aho - Corasick data structure allows us to find all occ occurrences of the strings in any text T in O(|T| + occ) time using O(m log m) bits of space, where m is the number of edges in the trie containing the strings. Fix any constant epsilon in (0, 2). We describe a compressed solution for the problem that, provided sigma <=m^delta for a constant delta < 1, works in O(|T| 1/epsilon log(1/epsilon) + occ) time, which is O(|T| + occ) since epsilon is constant, and occupies mH_k + 1.443 m + epsilon m + O(d log m/d) bits of space, for all 0 <= k <= max{0,alpha log_sigma m - 2} simultaneously, where alpha in (0,1) is an arbitrary constant and H_k is the kth-order empirical entropy of the trie. Hence, we reduce the 3.443m term in the space bounds of previously best succinct solutions to (1.443 + epsilon)m, thus solving an open problem posed by Belazzougui. Further, we notice that L = log binom{sigma (m+1)}{m} - O(log(sigma m)) is a worst-case space lower bound for any solution of the problem and, for d = o(m) and constant epsilon, our approach allows to achieve L + epsilon m bits of space, which gives an evidence that, for d = o(m), the space of our data structure is theoretically optimal up to the epsilon m additive term and it is hardly possible to eliminate the term 1.443m. In addition, we refine the space analysis of previous works by proposing a more appropriate definition for H_k. We also simplify the construction for practice adapting the fixed block compression boosting technique, then implement our data structure, and conduct a number of experiments showing that it is comparable to the state of the art in terms of time and is superior in space.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.13/LIPIcs.CPM.2019.13.pdf
multiple pattern matching
compressed space
Aho--Corasick automaton
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
14:1
14:17
10.4230/LIPIcs.CPM.2019.14
article
Hamming Distance Completeness
Labib, Karim
1
Uznański, Przemysław
2
Wolleb-Graf, Daniel
3
Google Zürich, Switzerland
Institute of Computer Science, University of Wrocław, Poland
Department of Computer Science, ETH Zürich, Switzerland
We show, given a binary integer function diamond that is piecewise polynomial, that (+,diamond) vector products are equivalent under one-to-polylog reductions to the computation of the Hamming distance. Examples include the dominance and l_{2p+1} distances for constant p. Our results imply equivalence (up to polylog factors) between the complexity of computing All Pairs Hamming Distance, All Pairs l_{2p+1} Distance and Dominance Matrix Product, and equivalence between Hamming Distance Pattern Matching, l_{2p+1} Pattern Matching and Less-Than Pattern Matching. The resulting algorithms for l_{2p+1} Pattern Matching and All Pairs l_{2p+1}, for 2p+1 = 3,5,7,... are likely to be optimal, given lack of progress in improving upper bounds for Hamming distance in the past 30 years. While reductions between selected pairs of products were presented in the past, our work is the first to generalize them to a general class of functions, showing that a wide class of "intermediate" complexity problems are in fact equivalent.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.14/LIPIcs.CPM.2019.14.pdf
fine grained complexity
approximate pattern matching
matrix products
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
15:1
15:13
10.4230/LIPIcs.CPM.2019.15
article
Approximating Approximate Pattern Matching
Studený, Jan
1
Uznański, Przemysław
2
Department of Computer Science, ETH Zürich, Switzerland
Institute of Computer Science, University of Wrocław, Poland
Given a text T of length n and a pattern P of length m, the approximate pattern matching problem asks for computation of a particular distance function between P and every m-substring of T. We consider a (1 +/- epsilon) multiplicative approximation variant of this problem, for l_p distance function. In this paper, we describe two (1+epsilon)-approximate algorithms with a runtime of O~(n/epsilon) for all (constant) non-negative values of p. For constant p >= 1 we show a deterministic (1+epsilon)-approximation algorithm. Previously, such run time was known only for the case of l_1 distance, by Gawrychowski and Uznański [ICALP 2018] and only with a randomized algorithm. For constant 0 <= p <= 1 we show a randomized algorithm for the l_p, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS 2015, SOSA 2018] for Hamming distance (case of p=0) and of Gawrychowski and Uznański for l_1 distance.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.15/LIPIcs.CPM.2019.15.pdf
Approximate Pattern Matching
l_p Distance
l_1 Distance
Hamming Distance
Approximation Algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
16:1
16:14
10.4230/LIPIcs.CPM.2019.16
article
Cartesian Tree Matching and Indexing
Park, Sung Gwan
1
Amir, Amihood
2
Landau, Gad M.
3
4
Park, Kunsoo
1
Seoul National University, Korea
Bar-Ilan University, Israel
University of Haifa, Israel
New York University, USA
We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.16/LIPIcs.CPM.2019.16.pdf
Cartesian tree matching
Pattern matching
Indexing
Parent-distance representation
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
17:1
17:14
10.4230/LIPIcs.CPM.2019.17
article
Indexing the Bijective BWT
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
Kärkkäinen, Juha
2
Köppl, Dominik
3
https://orcid.org/0000-0002-8721-4444
Pia̧tkowski, Marcin
4
https://orcid.org/0000-0001-5636-9497
Department of Informatics, Kyushu University, Fukuoka, Japan
Helsinki Institute of Information Technology (HIIT), Finland
Department of Informatics, Kyushu University, Japan Society for Promotion of Science (JSPS)
Nicolaus Copernicus University, Toruń, Poland
The Burrows-Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT is a bijective variant of it that has not yet been studied for text indexing applications. We fill this gap by proposing a self-index built on the bijective BWT . The self-index applies the backward search technique of the FM-index to find a pattern P with O(|P| lg|P|) backward search steps.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.17/LIPIcs.CPM.2019.17.pdf
Burrows-Wheeler Transform
Lyndon words
Text Indexing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
18:1
18:13
10.4230/LIPIcs.CPM.2019.18
article
On Maximal Repeats in Compressed Strings
Pape-Lange, Julian
1
https://orcid.org/0000-0001-6621-8369
Technische Universität Chemnitz, Straße der Nationen 62, 09111 Chemnitz, Germany
This paper presents and proves a new non-trivial upper bound on the number of maximal repeats of compressed strings. Using Theorem 1 of Raffinot’s article "On Maximal Repeats in Strings", this upper bound can be directly translated into an upper bound on the number of nodes in the Compacted Directed Acyclic Word Graphs of compressed strings.
More formally, this paper proves that the number of maximal repeats in a string with z (self-referential) LZ77-factors and without q-th powers is at most 3q(z+1)^3-2. Also, this paper proves that for 2000 <= z <= q this upper bound is tight up to a constant factor.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.18/LIPIcs.CPM.2019.18.pdf
Maximal repeats
Combinatorics on compressed strings
LZ77
Compact suffix automata
CDAWGs
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
19:1
19:19
10.4230/LIPIcs.CPM.2019.19
article
Dichotomic Selection on Words: A Probabilistic Analysis
Akhavi, Ali
1
Clément, Julien
1
Darthenay, Dimitri
1
Lhote, Loïck
1
Vallée, Brigitte
1
GREYC (Normandie Université, Unicaen, EnsiCaen, Cnrs), 14000, Caen, France
The paper studies the behaviour of selection algorithms that are based on dichotomy principles. On the entry formed by an ordered list L and a searched element x not in L, they return the interval of the list L the element x belongs to. We focus here on the case of words, where dichotomy principles lead to a selection algorithm designed by Crochemore, Hancart and Lecroq, which appears to be "quasi-optimal". We perform a probabilistic analysis of this algorithm that exhibits its quasi-optimality on average.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.19/LIPIcs.CPM.2019.19.pdf
dichotomic selection
text algorithms
analysis of algorithms
average case analysis of algorithms
trie
suffix array
lcp-array
information theory
numeration process
sources
entropy
coincidence
analytic combinatorics
depoissonization techniques
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
20:1
20:14
10.4230/LIPIcs.CPM.2019.20
article
Finding a Small Number of Colourful Components
Bulteau, Laurent
1
https://orcid.org/0000-0003-1645-9345
Dabrowski, Konrad K.
2
https://orcid.org/0000-0001-9515-6945
Fertin, Guillaume
3
https://orcid.org/0000-0002-8251-2012
Johnson, Matthew
2
https://orcid.org/0000-0002-7295-2663
Paulusma, Daniël
2
https://orcid.org/0000-0001-5945-9287
Vialette, Stéphane
1
https://orcid.org/0000-0003-2308-6970
Université Paris-Est, LIGM (UMR 8049), CNRS, ENPC, UPEM, ESIEE Paris, France
Department of Computer Science, Durham University, Durham, UK
Université de Nantes, LS2N (UMR 6004), CNRS, Nantes, France
A partition (V_1,...,V_k) of the vertex set of a graph G with a (not necessarily proper) colouring c is colourful if no two vertices in any V_i have the same colour and every set V_i induces a connected graph. The Colourful Partition problem, introduced by Adamaszek and Popa, is to decide whether a coloured graph (G,c) has a colourful partition of size at most k. This problem is related to the Colourful Components problem, introduced by He, Liu and Zhao, which is to decide whether a graph can be modified into a graph whose connected components form a colourful partition by deleting at most p edges.
Despite the similarities in their definitions, we show that Colourful Partition and Colourful Components may have different complexities for restricted instances. We tighten known NP-hardness results for both problems by closing a number of complexity gaps. In addition, we prove new hardness and tractability results for Colourful Partition. In particular, we prove that deciding whether a coloured graph (G,c) has a colourful partition of size 2 is NP-complete for coloured planar bipartite graphs of maximum degree 3 and path-width 3, but polynomial-time solvable for coloured graphs of treewidth 2.
Rather than performing an ad hoc study, we use our classical complexity results to guide us in undertaking a thorough parameterized study of Colourful Partition. We show that this leads to suitable parameters for obtaining FPT results and moreover prove that Colourful Components and Colourful Partition may have different parameterized complexities, depending on the chosen parameter.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.20/LIPIcs.CPM.2019.20.pdf
Colourful component
colourful partition
tree
treewidth
vertex cover
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
21:1
21:15
10.4230/LIPIcs.CPM.2019.21
article
Streaming Dictionary Matching with Mismatches
Gawrychowski, Paweł
1
Starikovskaya, Tatiana
2
University of Wrocław, 50-137 Wrocław, Poland
DIENS, École normale supérieure, PSL Research University, 75005 Paris, France
In the k-mismatch problem we are given a pattern of length m and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only O(k log m/k) space [Clifford, Kociumaka, Porat, SODA 2019]. In this work, we consider a strictly harder problem called dictionary matching with k mismatches, where we are given a dictionary of d patterns of lengths at most m and must find all their k-mismatch occurrences in the text, and show the first streaming algorithm for it. The algorithm uses O(k d log^k d polylog m) space and processes each position of the text in O(k log^k d polylog m + occ) time, where occ is the number of k-mismatch occurrences of the patterns that end at this position. The algorithm is randomised and outputs correct answers with high probability.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.21/LIPIcs.CPM.2019.21.pdf
Streaming
multiple pattern matching
Hamming distance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
22:1
22:14
10.4230/LIPIcs.CPM.2019.22
article
Quasi-Periodicity in Streams
Gawrychowski, Paweł
1
Radoszewski, Jakub
2
Starikovskaya, Tatiana
3
University of Wrocław, 50-137 Wrocław, Poland
Institute of Informatics, University of Warsaw, 02-097 Warsaw, Poland
DIENS, École normale supérieure, PSL Research University, 75005 Paris, France
In this work, we show two streaming algorithms for computing the length of the shortest cover of a string of length n. We start by showing a two-pass algorithm that uses O(log^2 n) space and then show a one-pass streaming algorithm that uses O(sqrt{n log n}) space. Both algorithms run in near-linear time. The algorithms are randomized and compute the answer incorrectly with probability inverse-polynomial in n. We also show that there is no sublinear-space streaming algorithm for computing the length of the shortest seed of a string.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.22/LIPIcs.CPM.2019.22.pdf
Streaming algorithms
quasi-periodicity
covers
seeds
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
23:1
23:11
10.4230/LIPIcs.CPM.2019.23
article
Computing Runs on a Trie
Sugahara, Ryo
1
Nakashima, Yuto
1
Inenaga, Shunsuke
1
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
Takeda, Masayuki
1
Department of Informatics, Kyushu University, Japan
A maximal repetition, or run, in a string, is a maximal periodic substring whose smallest period is at most half the length of the substring. In this paper, we consider runs that correspond to a path on a trie, or in other words, on a rooted edge-labeled tree where the endpoints of the path must be a descendant/ancestor of the other. For a trie with n edges, we show that the number of runs is less than n. We also show an O(n sqrt{log n}log log n) time and O(n) space algorithm for counting and finding the shallower endpoint of all runs. We further show an O(n log n) time and O(n) space algorithm for finding both endpoints of all runs. We also discuss how to improve the running time even more.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.23/LIPIcs.CPM.2019.23.pdf
runs
Lyndon words
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
24:1
24:20
10.4230/LIPIcs.CPM.2019.24
article
Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding
Cazaux, Bastien
1
2
https://orcid.org/0000-0002-1761-4354
Rivals, Eric
2
https://orcid.org/0000-0003-3791-3973
Department of Computer Science, University of Helsinki, Finland
L.I.R.M.M., CNRS, Université Montpellier, France
The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.24/LIPIcs.CPM.2019.24.pdf
Data Structure
Algorithm
Aho-Corasick Tree
compression
RLE
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
25:1
25:14
10.4230/LIPIcs.CPM.2019.25
article
Quasi-Linear-Time Algorithm for Longest Common Circular Factor
Alzamel, Mai
1
2
https://orcid.org/0000-0002-7590-9919
Crochemore, Maxime
1
3
https://orcid.org/0000-0003-1087-1419
Iliopoulos, Costas S.
1
https://orcid.org/0000-0003-3909-0077
Kociumaka, Tomasz
4
5
https://orcid.org/0000-0002-2477-1702
Radoszewski, Jakub
4
https://orcid.org/0000-0002-0067-6401
Rytter, Wojciech
4
https://orcid.org/0000-0002-9162-6724
Straszyński, Juliusz
4
https://orcid.org/0000-0003-2207-0053
Waleń, Tomasz
4
https://orcid.org/0000-0002-7369-3309
Zuba, Wiktor
4
https://orcid.org/0000-0002-1988-3507
Department of Informatics, King’s College London, UK
Department of Computer Science, King Saud University, Riyadh, Saudi Arabia
Laboratoire d\'Informatique Gaspard-Monge, Université Paris-Est, Marne-la-Vallée, France
Institute of Informatics, University of Warsaw, Poland
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
We introduce the Longest Common Circular Factor (LCCF) problem in which, given strings S and T of length at most n, we are to compute the longest factor of S whose cyclic shift occurs as a factor of T. It is a new similarity measure, an extension of the classic Longest Common Factor. We show how to solve the LCCF problem in O(n log^4 n) time using O(n log^2 n) space.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.25/LIPIcs.CPM.2019.25.pdf
longest common factor
circular pattern matching
internal pattern matching
intersection of hyperrectangles
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
26:1
26:20
10.4230/LIPIcs.CPM.2019.26
article
Simulating the DNA Overlap Graph in Succinct Space
Díaz-Domínguez, Diego
1
https://orcid.org/0000-0002-9071-0254
Gagie, Travis
2
3
https://orcid.org/0000-0003-3689-327X
Navarro, Gonzalo
3
4
https://orcid.org/0000-0002-2286-741X
CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile\and Department of Computer Science, University of Chile, Chile
School of Computer Science and Telecommunications, Diego Portales University, Chile
CeBiB - Center for Biotechnology and Bioengineering, University of Chile, Chile
Department of Computer Science, University of Chile, Chile
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph (dBG) of some order k. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper we propose rBOSS, a new data structure based on the Burrows-Wheeler Transform (BWT), which gets close to that ideal. Our rBOSS simultaneously encodes all the dBGs of a set of sequencing reads up to some order k, and for any dBG node v, it can compute in O(k) time all the other nodes whose labels have an overlap of at least m characters with the label of v, with m being a parameter. If we choose the parameter k equal to the size of the reads (assuming that all have equal length), then we can simulate the overlap graph of the read set. Instead of storing the edges of this graph explicitly, rBOSS computes them on the fly as we traverse the graph. As most BWT-based structures, rBOSS is unidirectional, meaning that we can retrieve only the suffix overlaps of the nodes. However, we exploit the property of the DNA reverse complements to simulate bi-directionality. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. The experimental results show that, using k=100, our rBOSS-based assembler can process ~500K reads of 150 characters long each (a FASTQ file of 185 MB) in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.26/LIPIcs.CPM.2019.26.pdf
Overlap graph
de Bruijn graph
DNA sequencing
Succinct ordinal trees
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
27:1
27:13
10.4230/LIPIcs.CPM.2019.27
article
Faster Queries for Longest Substring Palindrome After Block Edit
Funakoshi, Mitsuru
1
Nakashima, Yuto
1
Inenaga, Shunsuke
1
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
Takeda, Masayuki
1
Department of Informatics, Kyushu University, Japan
Palindromes are important objects in strings which have been extensively studied from combinatorial, algorithmic, and bioinformatics points of views. Manacher [J. ACM 1975] proposed a seminal algorithm that computes the longest substring palindromes (LSPals) of a given string in O(n) time, where n is the length of the string. In this paper, we consider the problem of finding the LSPal after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(l + log log n) time, after a substring in T is replaced by a string of arbitrary length l. This outperforms the query algorithm proposed in our previous work [CPM 2018] that uses O(l + log n) time for each query.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.27/LIPIcs.CPM.2019.27.pdf
palindromes
string algorithm
periodicity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
28:1
28:15
10.4230/LIPIcs.CPM.2019.28
article
A Rearrangement Distance for Fully-Labelled Trees
Bernardini, Giulia
1
Bonizzoni, Paola
1
Della Vedova, Gianluca
1
Patterson, Murray
1
DISCo, Università degli Studi Milano - Bicocca, Italy
The problem of comparing trees representing the evolutionary histories of cancerous tumors has turned out to be crucial, since there is a variety of different methods which typically infer multiple possible trees. A departure from the widely studied setting of classical phylogenetics, where trees are leaf-labelled, tumoral trees are fully labelled, i.e., every vertex has a label.
In this paper we provide a rearrangement distance measure between two fully-labelled trees. This notion originates from two operations: one which modifies the topology of the tree, the other which permutes the labels of the vertices, hence leaving the topology unaffected. While we show that the distance between two trees in terms of each such operation alone can be decided in polynomial time, the more general notion of distance when both operations are allowed is NP-hard to decide. Despite this result, we show that it is fixed-parameter tractable, and we give a 4-approximation algorithm when one of the trees is binary.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.28/LIPIcs.CPM.2019.28.pdf
Tree rearrangement distance
Cancer progression
Approximation algorithms
Computational complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
29:1
29:11
10.4230/LIPIcs.CPM.2019.29
article
On the Size of Overlapping Lempel-Ziv and Lyndon Factorizations
Urabe, Yuki
1
Nakashima, Yuto
1
Inenaga, Shunsuke
1
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
Takeda, Masayuki
1
Department of Informatics, Kyushu University, Japan
Lempel-Ziv (LZ) factorization and Lyndon factorization are well-known factorizations of strings. Recently, Kärkkäinen et al. studied the relation between the sizes of the two factorizations, and showed that the size of the Lyndon factorization is always smaller than twice the size of the non-overlapping LZ factorization [STACS 2017]. In this paper, we consider a similar problem for the overlapping version of the LZ factorization. Since the size of the overlapping LZ factorization is always smaller than the size of the non-overlapping LZ factorization and, in fact, can even be an O(log n) factor smaller, it is not immediately clear whether a similar bound as in previous work would hold. Nevertheless, in this paper, we prove that the size of the Lyndon factorization is always smaller than four times the size of the overlapping LZ factorization.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.29/LIPIcs.CPM.2019.29.pdf
Lyndon factorization
Lyndon words
Lempel-Ziv factorization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
30:1
30:19
10.4230/LIPIcs.CPM.2019.30
article
Online Algorithms for Constructing Linear-Size Suffix Trie
Hendrian, Diptarama
1
Takagi, Takuya
2
Inenaga, Shunsuke
3
Graduate School of Information Sciences, Tohoku University, Sendai, Japan
Fujitsu Laboratories Ltd., Kawasaki, Japan
Department of Informatics, Kyushu University, Fukuoka, Japan
The suffix trees are fundamental data structures for various kinds of string processing. The suffix tree of a string T of length n has O(n) nodes and edges, and the string label of each edge is encoded by a pair of positions in T. Thus, even after the tree is built, the input text T needs to be kept stored and random access to T is still needed. The linear-size suffix tries (LSTs), proposed by Crochemore et al. [Linear-size suffix tries, TCS 638:171-178, 2016], are a "stand-alone" alternative to the suffix trees. Namely, the LST of a string T of length n occupies O(n) total space, and supports pattern matching and other tasks in the same efficiency as the suffix tree without the need to store the input text T. Crochemore et al. proposed an offline algorithm which transforms the suffix tree of T into the LST of T in O(n log sigma) time and O(n) space, where sigma is the alphabet size. In this paper, we present two types of online algorithms which "directly" construct the LST, from right to left, and from left to right, without constructing the suffix tree as an intermediate structure. Both algorithms construct the LST incrementally when a new symbol is read, and do not access to the previously read symbols. The right-to-left construction algorithm works in O(n log sigma) time and O(n) space and the left-to-right construction algorithm works in O(n (log sigma + log n / log log n)) time and O(n) space. The main feature of our algorithms is that the input text does not need to be stored.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.30/LIPIcs.CPM.2019.30.pdf
Indexing structure
Linear-size suffix trie
Online algorithm
Pattern Matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
31:1
31:14
10.4230/LIPIcs.CPM.2019.31
article
Searching Long Repeats in Streams
Merkurev, Oleg
1
Shur, Arseny M.
1
Ural Federal University, Ekaterinburg, Russia
We consider two well-known related problems: Longest Repeated Substring (LRS) and Longest Repeated Reversed Substring (LRRS). Their streaming versions cannot be solved exactly; we show that only approximate solutions by Monte Carlo algorithms are possible, and prove a lower bound on consumed memory. For both problems, we present purely linear-time Monte Carlo algorithms working in O(E + n/E) space, where E is the additive approximation error. Within the same space bounds, we then present nearly real-time solutions, which require O(log n) time per symbol and O(n + n/E log n) time overall. The working space exactly matches the lower bound whenever E=O(n^{0.5}) and the size of the alphabet is Omega(n^{0.01}).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.31/LIPIcs.CPM.2019.31.pdf
Longest repeated substring
longest repeated reversed substring
streaming algorithm
Karp
Rabin fingerprint
suffix tree
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-06-06
128
32:1
32:11
10.4230/LIPIcs.CPM.2019.32
article
Computing the Antiperiod(s) of a String
Alamro, Hayam
1
2
Badkobeh, Golnaz
3
Belazzougui, Djamal
4
Iliopoulos, Costas S.
1
Puglisi, Simon J.
5
Department of Informatics, King’s College London, UK
Department of Information Systems, Princess Nourah bint Abulrahman University, Riyadh, KSA
Department of Computing, Goldsmiths, University of London, UK
Centre de Recherche sur I'nformation Scientiﬁque et Technique, Algeria
Department of Computer Science, University of Helsinki, Finland
A string S[1,n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol128-cpm2019/LIPIcs.CPM.2019.32/LIPIcs.CPM.2019.32.pdf
antiperiod
antipower
power
period
repetition
run
string