eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
1
472
10.4230/LIPIcs.CPM.2023
article
LIPIcs, Volume 259, CPM 2023, Complete Volume
Bulteau, Laurent
1
https://orcid.org/0000-0003-1645-9345
Lipták, Zsuzsanna
2
https://orcid.org/0000-0002-3233-0691
LIGM, CNRS, Université Gustave Eiffel, Marne-la-vallée, France
University of Verona, Italy
LIPIcs, Volume 259, CPM 2023, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023/LIPIcs.CPM.2023.pdf
LIPIcs, Volume 259, CPM 2023, Complete Volume
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
0:i
0:xvi
10.4230/LIPIcs.CPM.2023.0
article
Front Matter, Table of Contents, Preface, Conference Organization
Bulteau, Laurent
1
https://orcid.org/0000-0003-1645-9345
Lipták, Zsuzsanna
2
https://orcid.org/0000-0002-3233-0691
LIGM, CNRS, Université Gustave Eiffel, Marne-la-vallée, France
University of Verona, Italy
Front Matter, Table of Contents, Preface, Conference Organization
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.0/LIPIcs.CPM.2023.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
1:1
1:19
10.4230/LIPIcs.CPM.2023.1
article
Trie-Compressed Adaptive Set Intersection
Arroyuelo, Diego
1
2
https://orcid.org/0000-0002-2509-8097
Castillo, Juan Pablo
1
2
Departamento de Informática, Universidad Técnica Federico Santa María, Santiago, Chile
Millennium Institute for Foundational Research on Data, Santiago, Chile
We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set S ⊆ [0..u) of n elements can be represented using compressed space while supporting k-way intersections in adaptive O(kδlg(u/δ)) time, δ being the alternation measure introduced by Barbay and Kenyon. Our experimental results suggest that our approaches are competitive in practice, outperforming the most efficient alternatives (Partitioned Elias-Fano indexes, Roaring Bitmaps, and Recursive Universe Partitioning (RUP)) in several scenarios, offering in general relevant space-time trade-offs.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.1/LIPIcs.CPM.2023.1.pdf
Set intersection problem
Adaptive Algorithms
Compressed and compact data structures
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
2:1
2:12
10.4230/LIPIcs.CPM.2023.2
article
Approximation Algorithms for the Longest Run Subsequence Problem
Asahiro, Yuichi
1
https://orcid.org/0000-0002-9801-3285
Eto, Hiroshi
2
https://orcid.org/0000-0003-1456-1987
Gong, Mingyang
3
Jansson, Jesper
4
https://orcid.org/0000-0001-6859-8932
Lin, Guohui
3
https://orcid.org/0000-0003-4283-3396
Miyano, Eiji
2
https://orcid.org/0000-0002-4260-7818
Ono, Hirotaka
5
https://orcid.org/0000-0003-0845-3947
Tanaka, Shunichi
2
Kyushu Sangyo University, Fukuoka, Japan
Kyushu Institute of Technology, Iizuka, Japan
Uniersity of Alberta, Edmonton, Canada
Kyoto University, Kyoto, Japan
Nagoya University, Nagoya, Japan
We study the approximability of the Longest Run Subsequence problem (LRS for short). For a string S = s_1 ⋯ s_n over an alphabet Σ, a run of a symbol σ ∈ Σ in S is a maximal substring of consecutive occurrences of σ. A run subsequence S' of S is a sequence in which every symbol σ ∈ Σ occurs in at most one run. Given a string S, the goal of LRS is to find a longest run subsequence S^* of S such that the length |S^*| is maximized over all the run subsequences of S. It is known that LRS is APX-hard even if each symbol has at most two occurrences in the input string, and that LRS admits a polynomial-time k-approximation algorithm if the number of occurrences of every symbol in the input string is bounded by k. In this paper, we design a polynomial-time (k+1)/2-approximation algorithm for LRS under the k-occurrence constraint on input strings. For the case k = 2, we further improve the approximation ratio from 3/2 to 4/3.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.2/LIPIcs.CPM.2023.2.pdf
Longest run subsequence problem
bounded occurrence
approximation algorithm
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
3:1
3:11
10.4230/LIPIcs.CPM.2023.3
article
Optimal LZ-End Parsing Is Hard
Bannai, Hideo
1
https://orcid.org/0000-0002-6856-5185
Funakoshi, Mitsuru
2
3
https://orcid.org/0000-0002-2547-1509
Kurita, Kazuhiro
4
https://orcid.org/0000-0002-7638-3322
Nakashima, Yuto
2
https://orcid.org/0000-0001-6269-9353
Seto, Kazuhisa
5
https://orcid.org/0000-0001-9043-7019
Uno, Takeaki
6
M&D Data Science Center, Tokyo Medical and Dental University, Japan
Department of Informatics, Kyushu University, Fukuoka, Japan
Japan Society for the Promotion of Science, Tokyo, Japan
Nagoya University, Japan
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Japan
National Institute of Informatics, Tokyo, Japan
LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was initially proposed as a greedy parsing, where each phrase is determined greedily from left to right, as the longest factor that satisfies the above constraint [Kreft & Navarro, 2010]. In this work, we consider an optimal LZ-End parsing that has the minimum number of phrases in such parsings. We show that a decision version of computing the optimal LZ-End parsing is NP-complete by showing a reduction from the vertex cover problem. Moreover, we give a MAX-SAT formulation for the optimal LZ-End parsing adapting an approach for computing various NP-hard repetitiveness measures recently presented by [Bannai et al., 2022]. We also consider the approximation ratio of the size of greedy LZ-End parsing to the size of the optimal LZ-End parsing, and give a lower bound of the ratio which asymptotically approaches 2.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.3/LIPIcs.CPM.2023.3.pdf
Data Compression
LZ-End
Repetitiveness measures
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
4:1
4:18
10.4230/LIPIcs.CPM.2023.4
article
Sliding Window String Indexing in Streams
Bille, Philip
1
https://orcid.org/0000-0002-1120-5154
Fischer, Johannes
2
https://orcid.org/0000-0002-3384-597X
Gørtz, Inge Li
1
https://orcid.org/0000-0002-8322-4952
Pedersen, Max Rishøj
1
https://orcid.org/0000-0002-8850-6422
Stordalen, Tord Joakim
1
https://orcid.org/0000-0002-1525-0104
DTU Compute, Technical University of Denmark, Lyngby, Denmark
Department of Computer Science, Technische Universität Dortmund, Germany
Given a string S over an alphabet Σ, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching.
Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log (w/δ)). In particular, for a delay of δ = ε w we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.4/LIPIcs.CPM.2023.4.pdf
String indexing
pattern matching
sliding window
streaming
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
5:1
5:18
10.4230/LIPIcs.CPM.2023.5
article
Faster Algorithms for Computing the Hairpin Completion Distance and Minimum Ancestor
Boneh, Itai
1
Fried, Dvir
2
Miclăuş, Adrian
3
Popa, Alexandru
3
Reichman University, Herzliya, Israel
Bar-Ilan University, Ramat-Gan, Israel
Faculty of Mathematics and Computer Science, University of Bucharest, Romania
Hairpin completion is an operation on formal languages that has been inspired by hairpin formation in DNA biochemistry and has many applications especially in DNA computing. Consider s to be a string over the alphabet {A, C, G, T} such that a prefix/suffix of it matches the reversed complement of a substring of s. Then, in a hairpin completion operation the reversed complement of this prefix/suffix is added to the start/end of s forming a new string.
In this paper we study two problems related to the hairpin completion. The first problem asks the minimum number of hairpin operations necessary to transform one string into another, number that is called the hairpin completion distance. For this problem we show an algorithm of running time O(n²), where n is the maximum length of the two strings. Our algorithm improves on the algorithm of Manea (TCS 2010), that has running time O(n² log n).
In the minimum distance common hairpin completion ancestor problem we want to find, for two input strings x and y, a string w that minimizes the sum of the hairpin completion distances to x and y. Similarly, we present an algorithm with running time O(n²) that improves by a O(log n) factor the algorithm of Manea (TCS 2010).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.5/LIPIcs.CPM.2023.5.pdf
dynamic programming
incremental trees
exact algorithm
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
6:1
6:23
10.4230/LIPIcs.CPM.2023.6
article
On Distances Between Words with Parameters
Bourhis, Pierre
1
https://orcid.org/0000-0001-5699-0320
Boussidan, Aaron
2
Gambette, Philippe
2
https://orcid.org/0000-0001-7062-0262
Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
LIGM, Université Gustave Eiffel, CNRS, Marne-la-Vallée, France
The edit distance between parameterized words is a generalization of the classical edit distance where it is allowed to map particular letters of the first word, called parameters, to parameters of the second word before computing the distance. This problem has been introduced in particular for detection of code duplication, and the notion of words with parameters has also been used with different semantics in other fields. The complexity of several variants of edit distances between parameterized words has been studied, however, the complexity of the most natural one, the Levenshtein distance, remained open.
In this paper, we solve this open question and close the exhaustive analysis of all cases of parameterized word matching and function matching, showing that these problems are np-complete. To this aim, we also provide a comparison of the different problems, exhibiting several equivalences between them. We also provide and implement a MaxSAT encoding of the problem, as well as a simple FPT algorithm in the alphabet size, and study their efficiency on real data in the context of theater play structure comparison.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.6/LIPIcs.CPM.2023.6.pdf
String matching
edit distance
Levenshtein
parameterized matching
parameterized words
parameter words
instantiable words
NP-completeness
MAX-SAT
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
7:1
7:19
10.4230/LIPIcs.CPM.2023.7
article
Parameterized Algorithms for String Matching to DAGs: Funnels and Beyond
Cáceres, Manuel
1
https://orcid.org/0000-0003-0235-6951
Department of Computer Science, University of Helsinki, Finland
The problem of String Matching to Labeled Graphs (SMLG) asks to find all the paths in a labeled graph G = (V, E) whose spellings match that of an input string S ∈ Σ^m. SMLG can be solved in quadratic O(m|E|) time [Amir et al., JALG 2000], which was proven to be optimal by a recent lower bound conditioned on SETH [Equi et al., ICALP 2019]. The lower bound states that no strongly subquadratic time algorithm exists, even if restricted to directed acyclic graphs (DAGs).
In this work we present the first parameterized algorithms for SMLG on DAGs. Our parameters capture the topological structure of G. All our results are derived from a generalization of the Knuth-Morris-Pratt algorithm [Park and Kim, CPM 1995] optimized to work in time proportional to the number of prefix-incomparable matches.
To obtain the parameterization in the topological structure of G, we first study a special class of DAGs called funnels [Millani et al., JCO 2020] and generalize them to k-funnels and the class ST_k. We present several novel characterizations and algorithmic contributions on both funnels and their generalizations.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.7/LIPIcs.CPM.2023.7.pdf
string matching
parameterized algorithms
FPT inside P
string algorithms
graph algorithms
directed acyclic graphs
labeled graphs
funnels
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
8:1
8:18
10.4230/LIPIcs.CPM.2023.8
article
Optimal Near-Linear Space Heaviest Induced Ancestors
Charalampopoulos, Panagiotis
1
https://orcid.org/0000-0002-6024-1557
Dudek, Bartłomiej
2
https://orcid.org/0000-0003-2652-995X
Gawrychowski, Paweł
2
https://orcid.org/0000-0002-6993-5440
Pokorski, Karol
2
https://orcid.org/0000-0002-2140-8641
Birkbeck, University of London, UK
Institute of Computer Science, University of Wrocław, Poland
We revisit the Heaviest Induced Ancestors (HIA) problem that was introduced by Gagie, Gawrychowski, and Nekrich [CCCG 2013] and has a number of applications in string algorithms. Let T₁ and T₂ be two rooted trees whose nodes have weights that are increasing in all root-to-leaf paths, and labels on the leaves, such that no two leaves of a tree have the same label. A pair of nodes (u, v) ∈ T₁ × T₂ is induced if and only if there is a label shared by leaf-descendants of u and v. In an HIA query, given nodes x ∈ T₁ and y ∈ T₂, the goal is to find an induced pair of nodes (u, v) of the maximum total weight such that u is an ancestor of x and v is an ancestor of y.
Let n be the upper bound on the sizes of the two trees. It is known that no data structure of size 𝒪̃(n) can answer HIA queries in o(log n / log log n) time [Charalampopoulos, Gawrychowski, Pokorski; ICALP 2020]. This (unconditional) lower bound is a polyloglog n factor away from the query time of the fastest 𝒪̃(n)-size data structure known to date for the HIA problem [Abedin, Hooshmand, Ganguly, Thankachan; Algorithmica 2022]. In this work, we resolve the query-time complexity of the HIA problem for the near-linear space regime by presenting a data structure that can be built in 𝒪̃(n) time and answers HIA queries in 𝒪(log n/log log n) time. As a direct corollary, we obtain an 𝒪̃(n)-size data structure that maintains the LCS of a static string and a dynamic string, both of length at most n, in time optimal for this space regime.
The main ingredients of our approach are fractional cascading and the utilization of an 𝒪(log n/ log log n)-depth tree decomposition. The latter allows us to break through the Ω(log n) barrier faced by previous works, due to the depth of the considered heavy-path decompositions.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.8/LIPIcs.CPM.2023.8.pdf
data structures
string algorithms
fractional cascading
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
9:1
9:20
10.4230/LIPIcs.CPM.2023.9
article
From Bit-Parallelism to Quantum String Matching for Labelled Graphs
Equi, Massimo
1
https://orcid.org/0000-0001-8609-0040
Meijer-van de Griend, Arianne
1
https://orcid.org/0000-0001-5946-0958
Mäkinen, Veli
1
https://orcid.org/0000-0003-4454-1493
Department of Computer Science, University of Helsinki, Finland
Many problems that can be solved in quadratic time have bit-parallel speed-ups with factor w, where w is the computer word size. A classic example is computing the edit distance of two strings of length n, which can be solved in O(n²/w) time. In a reasonable classical model of computation, one can assume w = Θ(log n), and obtaining significantly better speed-ups is unlikely in the light of conditional lower bounds obtained for such problems.
In this paper, we study the connection of bit-parallelism to quantum computation, aiming to see if a bit-parallel algorithm could be converted to a quantum algorithm with better than logarithmic speed-up. We focus on string matching in labeled graphs, the problem of finding an exact occurrence of a string as the label of a path in a graph. This problem admits a quadratic conditional lower bound under a very restricted class of graphs (Equi et al. ICALP 2019), stating that no algorithm in the classical model of computation can solve the problem in time O(|P||E|^(1-ε)) or O(|P|^(1-ε)|E|). We show that a simple bit-parallel algorithm on such restricted family of graphs (level DAGs) can indeed be converted into a realistic quantum algorithm that attains subquadratic time complexity O(|E|√|P|).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.9/LIPIcs.CPM.2023.9.pdf
Bit-parallelism
quantum computation
string matching
level DAGs
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
10:1
10:18
10.4230/LIPIcs.CPM.2023.10
article
On the Impact of Morphisms on BWT-Runs
Fici, Gabriele
1
https://orcid.org/0000-0002-3536-327X
Romana, Giuseppe
1
https://orcid.org/0000-0002-3489-0684
Sciortino, Marinella
1
https://orcid.org/0000-0001-6928-0168
Urbina, Cristian
2
3
https://orcid.org/0000-0001-8979-9055
Department of Mathematics and Informatics, University of Palermo, Italy
Department of Computer Science, University of Chile, Santiago, Chile
Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms μ for which it is possible to find words w such that the difference r_bwt(μ(w))-r_bwt(w) is arbitrarily large.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.10/LIPIcs.CPM.2023.10.pdf
Morphism
Burrows-Wheeler transform
Sturmian word
Sturmian morphism
Thue-Morse morphism
Repetitiveness measure
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
11:1
11:20
10.4230/LIPIcs.CPM.2023.11
article
Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications
Gabory, Esteban
1
https://orcid.org/0000-0002-9897-1512
Mwaniki, Moses Njagi
2
https://orcid.org/0000-0002-4858-2375
Pisanti, Nadia
2
https://orcid.org/0000-0003-3915-7665
Pissis, Solon P.
1
3
https://orcid.org/0000-0002-1445-1932
Radoszewski, Jakub
4
https://orcid.org/0000-0002-0067-6401
Sweering, Michelle
1
https://orcid.org/0000-0003-1200-6015
Zuba, Wiktor
1
https://orcid.org/0000-0002-1988-3507
CWI, Amsterdam, The Netherlands
University of Pisa, Italy
Vrije Universiteit, Amsterdam, The Netherlands
Institute of Informatics, University of Warsaw, Poland
An elastic-degenerate (ED) string T is a sequence of n sets T[1],…,T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as ℒ(T) = {S_1 ⋯ S_n : S_i ∈ T[i] for all i ∈ [1,n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T₁ and T₂ of lengths n₁ and n₂, cardinalities m₁ and m₂, and sizes N₁ and N₂, respectively, we show the following:
- There is no 𝒪((N₁N₂)^{1-ε})-time algorithm, thus no 𝒪((N₁m₂+N₂m₁)^{1-ε})-time algorithm and no 𝒪((N₁n₂+N₂n₁)^{1-ε})-time algorithm, for any constant ε > 0, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false.
- There is no combinatorial 𝒪((N₁+N₂)^{1.2-ε}f(n₁,n₂))-time algorithm, for any constant ε > 0 and any function f, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false.
- An 𝒪(N₁log N₁log n₁+N₂log N₂log n₂)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T₁ and T₂ are given in a compact representation, we show that the problem is NP-complete.
- An 𝒪(N₁m₂+N₂m₁)-time algorithm for EDSI.
- An Õ(N₁^{ω-1}n₂+N₂^{ω-1}n₁)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size.
We also show that the techniques we develop have applications outside of ED string comparison.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.11/LIPIcs.CPM.2023.11.pdf
elastic-degenerate string
sequence comparison
languages intersection
pangenome
acronym identification
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
12:1
12:22
10.4230/LIPIcs.CPM.2023.12
article
Compressed Indexing for Consecutive Occurrences
Gawrychowski, Paweł
1
Gourdel, Garance
2
Starikovskaya, Tatiana
3
Steiner, Teresa Anna
4
Institute of Computer Science, University of Wrocław, Poland
DI/ENS, PSL Research University, IRISA Inria Rennes, France
DI/ENS, PSL Research University, Paris, France
DTU Compute, Technical University of Denmark, Lyngby, Denmark
The fundamental question considered in algorithms on strings is that of indexing, that is, preprocessing a given string for specific queries. By now we have a number of efficient solutions for this problem when the queries ask for an exact occurrence of a given pattern P. However, practical applications motivate the necessity of considering more complex queries, for example concerning near occurrences of two patterns. Recently, Bille et al. [CPM 2021] introduced a variant of such queries, called gapped consecutive occurrences, in which a query consists of two patterns P₁ and P₂ and a range [a,b], and one must find all consecutive occurrences (q₁,q₂) of P₁ and P₂ such that q₂-q₁ ∈ [a,b]. By their results, we cannot hope for a very efficient indexing structure for such queries, even if a = 0 is fixed (although at the same time they provided a non-trivial upper bound). Motivated by this, we focus on a text given as a straight-line program (SLP) and design an index taking space polynomial in the size of the grammar that answers such queries in time optimal up to polylog factors.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.12/LIPIcs.CPM.2023.12.pdf
Compressed indexing
two patterns
consecutive occurrences
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
13:1
13:19
10.4230/LIPIcs.CPM.2023.13
article
Order-Preserving Squares in Strings
Gawrychowski, Paweł
1
https://orcid.org/0000-0002-6993-5440
Ghazawi, Samah
2
Landau, Gad M.
2
3
Institute of Computer Science, University of Wrocław, Poland
Department of Computer Science, University of Haifa, Israel
Department of Computer Science and Engineering, NYU Tandon School of Engineering, New York University, Brooklyn, NY, USA
An order-preserving square in a string is a fragment of the form uv where u ≠ v and u is order-isomorphic to v. We show that a string w of length n over an alphabet of size σ contains 𝒪(σn) order-preserving squares that are distinct as words. This improves the upper bound of 𝒪(σ²n) by Kociumaka, Radoszewski, Rytter, and Waleń [TCS 2016]. Further, for every σ and n we exhibit a string with Ω(σn) order-preserving squares that are distinct as words, thus establishing that our upper bound is asymptotically tight. Finally, we design an 𝒪(σn) time algorithm that outputs all order-preserving squares that occur in a given string and are distinct as words. By our lower bound, this is optimal in the worst case.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.13/LIPIcs.CPM.2023.13.pdf
repetitions
distinct squares
order-isomorphism
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
14:1
14:18
10.4230/LIPIcs.CPM.2023.14
article
MUL-Tree Pruning for Consistency and Compatibility
Hampson, Christopher
1
https://orcid.org/0000-0002-6111-9465
Harvey, Daniel J.
2
Iliopoulos, Costas S.
1
https://orcid.org/0000-0003-3909-0077
Jansson, Jesper
2
https://orcid.org/0000-0001-6859-8932
Lim, Zara
1
https://orcid.org/0000-0001-6528-6060
Sung, Wing-Kin
3
4
5
https://orcid.org/0000-0001-7806-7086
Department of Informatics, King’s College London, UK
Graduate School of Informatics, Kyoto University, Japan
Department of Chemical Pathology, The Chinese University of Hong Kong, China
Hong Kong Genome Institute, Hong Kong Science Park, Shatin, China
Laboratory of Computational Genomics, Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, China
A multi-labelled tree (or MUL-tree) is a rooted tree leaf-labelled by a set of labels, where each label may appear more than once in the tree. We consider the MUL-tree Set Pruning for Consistency problem (MULSETPC), which takes as input a set of MUL-trees and asks whether there exists a perfect pruning of each MUL-tree that results in a consistent set of single-labelled trees. MULSETPC was proven to be NP-complete by Gascon et al. when the MUL-trees are binary, each leaf label is used at most three times, and the number of MUL-trees is unbounded. To determine the computational complexity of the problem when the number of MUL-trees is constant was left as an open problem.
Here, we resolve this question by proving a much stronger result, namely that MULSETPC is NP-complete even when there are only two MUL-trees, every leaf label is used at most twice, and every MUL-tree is either binary or has constant height. Furthermore, we introduce an extension of MULSETPC that we call MULSETPComp, which replaces the notion of consistency with compatibility, and prove that MULSETPComp is NP-complete even when there are only two MUL-trees, every leaf label is used at most thrice, and every MUL-tree has constant height. Finally, we present a polynomial-time algorithm for instances of MULSETPC with a constant number of binary MUL-trees, in the special case where every leaf label occurs exactly once in at least one MUL-tree.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.14/LIPIcs.CPM.2023.14.pdf
multi-labelled tree
phylogenetic tree
consistent
compatible
pruning
algorithm
NP-complete
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
15:1
15:15
10.4230/LIPIcs.CPM.2023.15
article
Linear-Time Computation of Cyclic Roots and Cyclic Covers of a String
Iliopoulos, Costas S.
1
https://orcid.org/0000-0003-3909-0077
Kociumaka, Tomasz
2
https://orcid.org/0000-0002-2477-1702
Radoszewski, Jakub
3
https://orcid.org/0000-0002-0067-6401
Rytter, Wojciech
3
https://orcid.org/0000-0002-9162-6724
Waleń, Tomasz
3
https://orcid.org/0000-0002-7369-3309
Zuba, Wiktor
4
https://orcid.org/0000-0002-1988-3507
Department of Informatics, King’s College London, London, UK
Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Institute of Informatics, University of Warsaw, Poland
CWI, Amsterdam, The Netherlands
Cyclic versions of covers and roots of a string are considered in this paper. A prefix V of a string S is a cyclic root of S if S is a concatenation of cyclic rotations of V. A prefix V of S is a cyclic cover of S if the occurrences of the cyclic rotations of V cover all positions of S. We present 𝒪(n)-time algorithms computing all cyclic roots (using number-theoretic tools) and all cyclic covers (using tools related to seeds) of a length-n string over an integer alphabet. Our results improve upon 𝒪(n log log n) and 𝒪(n log n) time complexities of recent algorithms of Grossi et al. (WALCOM 2023) for the respective problems and provide novel approaches to the problems. As a by-product, we obtain an optimal data structure for Internal Circular Pattern Matching queries that generalize Internal Pattern Matching and Cyclic Equivalence queries of Kociumaka et al. (SODA 2015).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.15/LIPIcs.CPM.2023.15.pdf
cyclic cover
cyclic root
circular pattern matching
internal pattern matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
16:1
16:16
10.4230/LIPIcs.CPM.2023.16
article
Faster Prefix-Sorting Algorithms for Deterministic Finite Automata
Kim, Sung-Hwan
1
https://orcid.org/0000-0002-1117-5020
Olivares, Francisco
2
3
https://orcid.org/0000-0001-7881-9794
Prezza, Nicola
1
https://orcid.org/0000-0003-3553-4953
DAIS, Ca' Foscari University of Venice, Italy
CeBiB - Centre for Biotechnology and Bioengineering, Santiago, Chile
Department of Computer Science, University of Chile, Santiago, Chile
Sorting is a fundamental algorithmic pre-processing technique which often allows to represent data more compactly and, at the same time, speeds up search queries on it. In this paper, we focus on the well-studied problem of sorting and indexing string sets. Since the introduction of suffix trees in 1973, dozens of suffix sorting algorithms have been described in the literature. In 2017, these techniques were extended to sets of strings described by means of finite automata: the theory of Wheeler graphs [Gagie et al., TCS'17] introduced automata whose states can be totally-sorted according to the co-lexicographic (co-lex in the following) order of the prefixes of words accepted by the automaton. More recently, in [Cotumaccio, Prezza, SODA'21] it was shown how to extend these ideas to arbitrary automata by means of partial co-lex orders. This work showed that a co-lex order of minimum width (thus optimizing search query times) on deterministic finite automata (DFAs) can be computed in O(m² + n^{5/2}) time, m being the number of transitions and n the number of states of the input DFA.
In this paper, we exhibit new combinatorial properties of the minimum-width co-lex order of DFAs and exploit them to design faster prefix sorting algorithms. In particular, we describe two algorithms sorting arbitrary DFAs in O(mn) and O(n² log n) time, respectively, and an algorithm sorting acyclic DFAs in O(m log n) time. Within these running times, all algorithms compute also a smallest chain partition of the partial order (required to index the DFA). We present an experiment result to show that an optimized implementation of the O(n² log n)-time algorithm exhibits a nearly-linear behaviour on large deterministic pan-genomic graphs and is thus also of practical interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.16/LIPIcs.CPM.2023.16.pdf
String Matching
Deterministic Finite Automata
Graph Indexing
Co-lexicographical Sorting
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
17:1
17:21
10.4230/LIPIcs.CPM.2023.17
article
Encoding Hard String Problems with Answer Set Programming
Köppl, Dominik
1
https://orcid.org/0000-0002-8721-4444
Department of Computer Science, Universität Münster, Germany
Despite the simple, one-dimensional nature of strings, several computationally hard problems on strings are known. Tackling hard problems beyond sizes of toy instances with straight-forward solutions is infeasible. To solve these problems on datasets of even small sizes, effort has to be put into the conception of algorithms leveraging profound characteristics of the input. Finding these characteristics can be eased by rapidly creating and evaluating prototypes of new concepts in how to tackle hard problems. Such a rapid-prototyping method for hard problems is answer set programming (ASP). In this light, we study the application of ASP on five NP-hard optimization problems in the field of strings. We provide MAX-SAT and ASP encodings, and empirically reason about the merits and flaws when working with ASP solvers.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.17/LIPIcs.CPM.2023.17.pdf
optimization problems
answer set programming
MAX-SAT encoding
NP-hard string problems
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
18:1
18:18
10.4230/LIPIcs.CPM.2023.18
article
On the Complexity of Parameterized Local Search for the Maximum Parsimony Problem
Komusiewicz, Christian
1
https://orcid.org/0000-0003-0829-7032
Linz, Simone
2
https://orcid.org/0000-0003-0862-9594
Morawietz, Nils
3
https://orcid.org/0000-0002-7283-4982
Schestag, Jannik
3
https://orcid.org/0000-0001-7767-2970
Institute of Computer Science, Friedrich Schiller Universität Jena, Germany
School of Computer Science, University of Auckland, New Zealand
Fachbereich Mathematik und Informatik, Philipps-Universität Marburg, Germany
Maximum Parsimony is the problem of computing a most parsimonious phylogenetic tree for a taxa set X from character data for X. A common strategy to attack this notoriously hard problem is to perform a local search over the phylogenetic tree space. Here, one is given a phylogenetic tree T and wants to find a more parsimonious tree in the neighborhood of T. We study the complexity of this problem when the neighborhood contains all trees within distance k for several classic distance functions. For the nearest neighbor interchange (NNI), subtree prune and regraft (SPR), tree bisection and reconnection (TBR), and edge contraction and refinement (ECR) distances, we show that, under the exponential time hypothesis, there are no algorithms with running time |I|^o(k) where |I| is the total input size. Hence, brute-force algorithms with running time |X|^𝒪(k) ⋅ |I| are essentially optimal.
In contrast to the above distances, we observe that for the sECR-distance, where the contracted edges are constrained to form a subtree, a better solution within distance k can be found in k^𝒪(k) ⋅ |I|^𝒪(1) time.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.18/LIPIcs.CPM.2023.18.pdf
phylogenetic trees
parameterized complexity
tree distances
NNI
TBR
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
19:1
19:10
10.4230/LIPIcs.CPM.2023.19
article
String Factorization via Prefix Free Families
Kraus, Matan
1
Lewenstein, Moshe
1
Popa, Alexandru
2
Porat, Ely
1
Sadia, Yonathan
1
Bar-Ilan University, Ramat-Gan, Israel
Faculty of Mathematics and Computer Science, University of Bucharest, Romania
A factorization of a string S is a partition of w into substrings u_1,… ,u_k such that S = u_1 u_2 ⋯ u_k. Such a partition is called equality-free if no two factors are equal: u_i ≠ u_j, ∀ i,j with i ≠ j. The maximum equality-free factorization problem is to find for a given string S, the largest integer k for which S admits an equality-free factorization with k factors.
Equality-free factorizations have lately received attention because of their applications in DNA self-assembly. The best approximation algorithm known for the problem is the natural greedy algorithm, that chooses iteratively from left to right the shortest factor that does not appear before. This algorithm has a √n approximation ratio (SOFSEM 2020) and it is an open problem whether there is a better solution.
Our main result is to show that the natural greedy algorithm is a Θ(n^{1/4}) approximation algorithm for the maximum equality-free factorization problem. Thus, we disprove one of the conjectures of Mincu and Popa (SOFSEM 2020) according to which the greedy algorithm is a Θ(√n) approximation.
The most challenging part of the proof is to show that the greedy algorithm is an O(n^{1/4}) approximation. We obtain this algorithm via prefix free factor families, i.e. a set of non-overlapping factors of the string which are pairwise non-prefixes of each other. In the paper we show the relation between prefix free factor families and the maximum equality-free factorization. Moreover, as a byproduct we present another approximation algorithm that achieves an approximation ratio of O(n^{1/4}) that we believe is of independent interest and may lead to improved algorithms. We then show that the natural greedy algorithm has an approximation ratio that is Ω(n^{1/4}) via a clever analysis which shows that the greedy algorithm is Θ(n^{1/4}) for the maximum equality-free factorization problem.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.19/LIPIcs.CPM.2023.19.pdf
string factorization
NP-hard problem
approximation algorithm
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
20:1
20:12
10.4230/LIPIcs.CPM.2023.20
article
Improving the Sensitivity of MinHash Through Hash-Value Analysis
Kucherov, Gregory
1
https://orcid.org/0000-0001-5899-5424
Skiena, Steven
2
https://orcid.org/0000-0003-0397-7514
LIGM, CNRS/Université Gustave Eiffel, Marne-la-Vallée, France
Dept. of Computer Science, Stony Brook University, Stony Brook, NY, USA
MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.20/LIPIcs.CPM.2023.20.pdf
MinHash sketching
sequence similarity
hashing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
21:1
21:20
10.4230/LIPIcs.CPM.2023.21
article
Suffix-Prefix Queries on a Dictionary
Loukides, Grigorios
1
https://orcid.org/0000-0003-0888-5061
Pissis, Solon P.
2
3
https://orcid.org/0000-0002-1445-1932
Thankachan, Sharma V.
4
https://orcid.org/0000-0002-6852-1035
Zuba, Wiktor
2
https://orcid.org/0000-0002-1988-3507
Department of Informatics, King’s College London, UK
CWI, Amsterdam, The Netherlands
Vrije Universiteit, Amsterdam, The Netherlands
North Carolina State University, Raleigh, NC, USA
In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S_1,…,S_k, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1,k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ n^𝒪(1), APSP can be solved in the optimal 𝒪(n+k²) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992].
In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k² usually dominates n, and thus the k² factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries:
- One-to-One(i,j): output SPL_{i,j}.
- One-to-All(i): output SPL_{i,j} for every j ∈ [1,k].
- Report(i,𝓁): output all distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer.
- Count(i,𝓁): output the number of distinct j ∈ [1,k] such that SPL_{i,j} ≥ 𝓁, where 𝓁 ≥ 0 is an integer.
- Top(i,K): output K distinct j ∈ [1,k] with the highest values of SPL_{i,j} breaking ties arbitrarily.
We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ n^𝒪(1). We show the following upper bounds:
Query | Space (words) | Query time | Note
One-to-One(i,j) | 𝒪(n) | 𝒪(log log k) | Theorem 11
One-to-All(i) | 𝒪(n) | 𝒪(k) | Theorem 14
Report(i,𝓁) | 𝒪(n) | 𝒪(log n/log log n+output) | Theorem 19(i)
Count(i,𝓁) | 𝒪(n) | 𝒪(log n/log log n) | Theorem 19(ii)
Top(i,K) | 𝒪(n) | 𝒪(log² n/log log n+K) | Theorem 22
We also present efficient algorithms for constructing these data structures.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.21/LIPIcs.CPM.2023.21.pdf
all-pairs suffix-prefix
suffix-prefix queries
internal pattern matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
22:1
22:15
10.4230/LIPIcs.CPM.2023.22
article
Merging Sorted Lists of Similar Strings
Myers, Gene
1
2
https://orcid.org/0000-0002-6580-7839
Okinawa Institute of Science and Technology, Japan
MPI for Molecular Cell Biology and Genetics, Dresden, Germany
Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.22/LIPIcs.CPM.2023.22.pdf
heap
trie
longest common prefix
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
23:1
23:15
10.4230/LIPIcs.CPM.2023.23
article
PalFM-Index: FM-Index for Palindrome Pattern Matching
Nagashita, Shinya
1
I, Tomohiro
1
https://orcid.org/0000-0001-9106-6192
Kyushu Institute of Technology, Fukuoka, Japan
The palindrome pattern matching (pal-matching) is a kind of generalized pattern matching, in which two strings x and y of same length are considered to match (pal-match) if they have the same palindromic structures, i.e., for any possible 1 ≤ i < j ≤ |x| = |y|, x[i..j] is a palindrome if and only if y[i..j] is a palindrome. The pal-matching problem is the problem of searching for, in a text, the occurrences of the substrings that pal-match with a pattern. Given a text T of length n over an alphabet of size σ, an index for pal-matching is to support, given a pattern P of length m, the counting queries that compute the number occ of occurrences of P and the locating queries that compute the occurrences of P. The authors in [I et al., Theor. Comput. Sci., 2013] proposed an O(n lg n)-bit data structure to support the counting queries in O(m lg σ) time and the locating queries in O(m lg σ + occ) time. In this paper, we propose an FM-index type index for the pal-matching problem, which we call the PalFM-index, that occupies 2n lg min(σ, lg n) + 2n + o(n) bits of space and supports the counting queries in O(m) time. The PalFM-indexes can support the locating queries in O(m + Δ occ) time by adding n/Δ lg n + n + o(n) bits of space, where Δ is a parameter chosen from {1, 2, … , n} in the preprocessing phase.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.23/LIPIcs.CPM.2023.23.pdf
Palindrome matching
Generalized string pattern matching
Indexing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
24:1
24:17
10.4230/LIPIcs.CPM.2023.24
article
Computing MEMs on Repetitive Text Collections
Navarro, Gonzalo
1
2
Center for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
Department of Computer Science, University of Chile, Santiago, Chile
We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m] on a large repetitive text collection T[1..n], which is represented as a (hopefully much smaller) run-length context-free grammar of size g_{rl}. We show that the problem can be solved in time O(m² log^ε n), for any constant ε > 0, on a data structure of size O(g_{rl}). Further, on a locally consistent grammar of size O(δ log n/δ), the time decreases to O(m log m(log m + log^ε n)). The value δ is a function of the substring complexity of T and Ω(δ log n/δ) is a tight lower bound on the compressibility of repetitive texts T, so our structure has optimal size in terms of n and δ.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.24/LIPIcs.CPM.2023.24.pdf
grammar-based indices
maximal exact matches
locally consistent grammars
substring complexity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
25:1
25:17
10.4230/LIPIcs.CPM.2023.25
article
L-Systems for Measuring Repetitiveness
Navarro, Gonzalo
1
2
https://orcid.org/0000-0002-2286-741X
Urbina, Cristian
1
2
https://orcid.org/0000-0001-8979-9055
Department of Computer Science, University of Chile, Santiago, Chile
Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence.
In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness.
We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.25/LIPIcs.CPM.2023.25.pdf
L-systems
String morphisms
Repetitiveness measures
Text compression
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2023-06-21
259
26:1
26:14
10.4230/LIPIcs.CPM.2023.26
article
MONI Can Find k-MEMs
Tatarnikov, Igor
1
https://orcid.org/0000-0001-5728-7493
Shahrabi Farahani, Ardavan
1
Kashgouli, Sana
1
Gagie, Travis
1
https://orcid.org/0000-0003-3689-327X
Dalhousie University, Halifax, Canada
Suppose we are asked to index a text T [0..n - 1] such that, given a pattern P [0..m - 1], we can quickly report the maximal substrings of P that each occur in T at least k times. We first show how we can add O (r log n) bits to Rossi et al.’s recent MONI index, where r is the number of runs in the Burrows-Wheeler Transform of T, such that it supports such queries in O (k m log n) time. We then show how, if we are given k at construction time, we can reduce the query time to O (m log n).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol259-cpm2023/LIPIcs.CPM.2023.26/LIPIcs.CPM.2023.26.pdf
Compact data structures
Burrows-Wheeler Transform
run-length compression
maximal exact matches