Document

**Published in:** LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

Given an increasing sequence of integers x₁,…,x_n from a universe {0,…,u-1}, the monotone minimal perfect hash function (MMPHF) for this sequence is a data structure that answers the following rank queries: rank(x) = i if x = x_i, for i ∈ {1,…,n}, and rank(x) is arbitrary otherwise. Assadi, Farach-Colton, and Kuszmaul recently presented at SODA'23 a proof of the lower bound Ω(n min{log log log u, log n}) for the bits of space required by MMPHF, provided u ≥ n 2^{2^{√{log log n}}}, which is tight since there is a data structure for MMPHF that attains this space bound (and answers the queries in O(log u) time). In this paper, we close the remaining gap by proving that, for u ≥ (1+ε)n, where ε > 0 is any constant, the tight lower bound is Ω(n min{log log log u/n, log n}), which is also attainable; we observe that, for all reasonable cases when n < u < (1+ε)n, known facts imply tight bounds, which virtually settles the problem. Along the way we substantially simplify the proof of Assadi et al. replacing a part of their heavy combinatorial machinery by trivial observations. However, an important part of the proof still remains complicated. This part of our paper repeats arguments of Assadi et al. and is not novel. Nevertheless, we include it, for completeness, offering a somewhat different perspective on these arguments.

Dmitry Kosolobov. Simplified Tight Bounds for Monotone Minimal Perfect Hashing. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 19:1-19:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{kosolobov:LIPIcs.CPM.2024.19, author = {Kosolobov, Dmitry}, title = {{Simplified Tight Bounds for Monotone Minimal Perfect Hashing}}, booktitle = {35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)}, pages = {19:1--19:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-326-3}, ISSN = {1868-8969}, year = {2024}, volume = {296}, editor = {Inenaga, Shunsuke and Puglisi, Simon J.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.19}, URN = {urn:nbn:de:0030-drops-201296}, doi = {10.4230/LIPIcs.CPM.2024.19}, annote = {Keywords: monotone minimal perfect hashing, lower bound, MMPHF, hash} }

Document

**Published in:** LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)

The notions of synchronizing and partitioning sets are recently introduced variants of locally consistent parsings with a great potential in problem-solving. In this paper we propose a deterministic algorithm that constructs for a given readonly string of length n over the alphabet {0,1,…,n^{𝒪(1)}} a variant of a τ-partitioning set with size 𝒪(b) and τ = n/b using 𝒪(b) space and 𝒪(1/(ε)n) time provided b ≥ n^ε, for ε > 0. As a corollary, for b ≥ n^ε and constant ε > 0, we obtain linear time construction algorithms with 𝒪(b) space on top of the string for two major small-space indexes: a sparse suffix tree, which is a compacted trie built on b chosen suffixes of the string, and a longest common extension (LCE) index, which occupies 𝒪(b) space and allows us to compute the longest common prefix for any pair of substrings in 𝒪(n/b) time. For both, the 𝒪(b) construction storage is asymptotically optimal since the tree itself takes 𝒪(b) space and any LCE index with 𝒪(n/b) query time must occupy at least 𝒪(b) space by a known trade-off (at least for b ≥ Ω(n / log n)). In case of arbitrary b ≥ Ω(log² n), we present construction algorithms for the partitioning set, sparse suffix tree, and LCE index with 𝒪(nlog_b n) running time and 𝒪(b) space, thus also improving the state of the art.

Dmitry Kosolobov and Nikita Sivukhin. Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 20:1-20:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{kosolobov_et_al:LIPIcs.CPM.2024.20, author = {Kosolobov, Dmitry and Sivukhin, Nikita}, title = {{Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space}}, booktitle = {35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)}, pages = {20:1--20:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-326-3}, ISSN = {1868-8969}, year = {2024}, volume = {296}, editor = {Inenaga, Shunsuke and Puglisi, Simon J.}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.20}, URN = {urn:nbn:de:0030-drops-201309}, doi = {10.4230/LIPIcs.CPM.2024.20}, annote = {Keywords: (\tau,\delta)-partitioning set, longest common extension, sparse suffix tree} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.

Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted Ancestors in Suffix Trees Revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 8:1-8:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2021.8, author = {Belazzougui, Djamal and Kosolobov, Dmitry and Puglisi, Simon J. and Raman, Rajeev}, title = {{Weighted Ancestors in Suffix Trees Revisited}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {8:1--8:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.8}, URN = {urn:nbn:de:0030-drops-139594}, doi = {10.4230/LIPIcs.CPM.2021.8}, annote = {Keywords: suffix tree, weighted ancestors, irreducible LCP, deterministic substring hashing} }

Document

**Published in:** LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Given d strings over the alphabet {0,1,...,sigma{-}1}, the classical Aho - Corasick data structure allows us to find all occ occurrences of the strings in any text T in O(|T| + occ) time using O(m log m) bits of space, where m is the number of edges in the trie containing the strings. Fix any constant epsilon in (0, 2). We describe a compressed solution for the problem that, provided sigma <=m^delta for a constant delta < 1, works in O(|T| 1/epsilon log(1/epsilon) + occ) time, which is O(|T| + occ) since epsilon is constant, and occupies mH_k + 1.443 m + epsilon m + O(d log m/d) bits of space, for all 0 <= k <= max{0,alpha log_sigma m - 2} simultaneously, where alpha in (0,1) is an arbitrary constant and H_k is the kth-order empirical entropy of the trie. Hence, we reduce the 3.443m term in the space bounds of previously best succinct solutions to (1.443 + epsilon)m, thus solving an open problem posed by Belazzougui. Further, we notice that L = log binom{sigma (m+1)}{m} - O(log(sigma m)) is a worst-case space lower bound for any solution of the problem and, for d = o(m) and constant epsilon, our approach allows to achieve L + epsilon m bits of space, which gives an evidence that, for d = o(m), the space of our data structure is theoretically optimal up to the epsilon m additive term and it is hardly possible to eliminate the term 1.443m. In addition, we refine the space analysis of previous works by proposing a more appropriate definition for H_k. We also simplify the construction for practice adapting the fixed block compression boosting technique, then implement our data structure, and conduct a number of experiments showing that it is comparable to the state of the art in terms of time and is superior in space.

Dmitry Kosolobov and Nikita Sivukhin. Compressed Multiple Pattern Matching. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{kosolobov_et_al:LIPIcs.CPM.2019.13, author = {Kosolobov, Dmitry and Sivukhin, Nikita}, title = {{Compressed Multiple Pattern Matching}}, booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)}, pages = {13:1--13:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-103-0}, ISSN = {1868-8969}, year = {2019}, volume = {128}, editor = {Pisanti, Nadia and P. Pissis, Solon}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.13}, URN = {urn:nbn:de:0030-drops-104847}, doi = {10.4230/LIPIcs.CPM.2019.13}, annote = {Keywords: multiple pattern matching, compressed space, Aho--Corasick automaton} }

Document

**Published in:** LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting.

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{norri_et_al:LIPIcs.WABI.2018.15, author = {Norri, Tuukka and Cazaux, Bastien and Kosolobov, Dmitry and M\"{a}kinen, Veli}, title = {{Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time}}, booktitle = {18th International Workshop on Algorithms in Bioinformatics (WABI 2018)}, pages = {15:1--15:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-082-8}, ISSN = {1868-8969}, year = {2018}, volume = {113}, editor = {Parida, Laxmi and Ukkonen, Esko}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.15}, URN = {urn:nbn:de:0030-drops-93175}, doi = {10.4230/LIPIcs.WABI.2018.15}, annote = {Keywords: Pan-genome indexing, founder reconstruction, dynamic programming, positional Burrows-Wheeler transform, range minimum query} }

Document

**Published in:** LIPIcs, Volume 96, 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018)

This paper investigates the size in bits of the LZ77 encoding, which is the most popular and efficient variant of the Lempel-Ziv encodings used in data compression. We prove that, for a wide natural class of variable-length encoders for LZ77 phrases, the size of the greedily constructed LZ77 encoding on constant alphabets is within a factor O(log n / log log log n) of the optimal LZ77 encoding, where n is the length of the processed string. We describe a series of examples showing that, surprisingly, this bound is tight, thus improving both the previously known upper and lower bounds. Further, we obtain a more detailed bound O(min{z, log n / log log z}), which uses the number z of phrases in the greedy LZ77 encoding as a parameter, and construct a series of examples showing that this bound is tight even for binary alphabet. We then investigate the problem on non-constant alphabets: we show that the known O(log n) bound is tight even for alphabets of logarithmic size, and provide tight bounds for some other important cases.

Dmitry Kosolobov. Relations Between Greedy and Bit-Optimal LZ77 Encodings. In 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 96, pp. 46:1-46:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{kosolobov:LIPIcs.STACS.2018.46, author = {Kosolobov, Dmitry}, title = {{Relations Between Greedy and Bit-Optimal LZ77 Encodings}}, booktitle = {35th Symposium on Theoretical Aspects of Computer Science (STACS 2018)}, pages = {46:1--46:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-062-0}, ISSN = {1868-8969}, year = {2018}, volume = {96}, editor = {Niedermeier, Rolf and Vall\'{e}e, Brigitte}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2018.46}, URN = {urn:nbn:de:0030-drops-84894}, doi = {10.4230/LIPIcs.STACS.2018.46}, annote = {Keywords: Lempel-Ziv, LZ77 encoding, greedy LZ77, bit optimal LZ77} }

Document

**Published in:** LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

We present a deterministic algorithm that constructs in linear time and space the LZ-End parsing (a variation of LZ77) of a given string over an integer polynomially bounded alphabet.

Dominik Kempa and Dmitry Kosolobov. LZ-End Parsing in Linear Time. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 53:1-53:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{kempa_et_al:LIPIcs.ESA.2017.53, author = {Kempa, Dominik and Kosolobov, Dmitry}, title = {{LZ-End Parsing in Linear Time}}, booktitle = {25th Annual European Symposium on Algorithms (ESA 2017)}, pages = {53:1--53:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-049-1}, ISSN = {1868-8969}, year = {2017}, volume = {87}, editor = {Pruhs, Kirk and Sohler, Christian}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.53}, URN = {urn:nbn:de:0030-drops-78471}, doi = {10.4230/LIPIcs.ESA.2017.53}, annote = {Keywords: LZ-End, LZ77, construction algorithm, linear time} }

Document

**Published in:** LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Palindromic length of a string is the minimum number of palindromes whose concatenation is equal to this string. The problem of finding the palindromic length drew some attention, and a few O(n log n) time online algorithms were recently designed for it. In this paper we present the first linear time online algorithm for this problem.

Kirill Borozdin, Dmitry Kosolobov, Mikhail Rubinchik, and Arseny M. Shur. Palindromic Length in Linear Time. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 23:1-23:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{borozdin_et_al:LIPIcs.CPM.2017.23, author = {Borozdin, Kirill and Kosolobov, Dmitry and Rubinchik, Mikhail and Shur, Arseny M.}, title = {{Palindromic Length in Linear Time}}, booktitle = {28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)}, pages = {23:1--23:12}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-039-2}, ISSN = {1868-8969}, year = {2017}, volume = {78}, editor = {K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.23}, URN = {urn:nbn:de:0030-drops-73389}, doi = {10.4230/LIPIcs.CPM.2017.23}, annote = {Keywords: palindrome, palindromic length, palindromic factorization, online} }

Document

**Published in:** LIPIcs, Volume 30, 32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015)

The complexity of computing the Lempel-Ziv decomposition and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in O(n\log\sigma) time, where n is the length of the input string and \sigma is the number of distinct letters in it. We prove an \Omega(n\log\sigma) lower bound on the number of comparisons required to construct the Lempel-Ziv decomposition and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv decomposition cannot achieve an o(n\log\sigma) time bound. In contrast with this, we exhibit an O(n) decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv decomposition. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.

Dmitry Kosolobov. Lempel-Ziv Factorization May Be Harder Than Computing All Runs. In 32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 30, pp. 582-593, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)

Copy BibTex To Clipboard

@InProceedings{kosolobov:LIPIcs.STACS.2015.582, author = {Kosolobov, Dmitry}, title = {{Lempel-Ziv Factorization May Be Harder Than Computing All Runs}}, booktitle = {32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015)}, pages = {582--593}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-78-1}, ISSN = {1868-8969}, year = {2015}, volume = {30}, editor = {Mayr, Ernst W. and Ollinger, Nicolas}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2015.582}, URN = {urn:nbn:de:0030-drops-49438}, doi = {10.4230/LIPIcs.STACS.2015.582}, annote = {Keywords: Lempel-Ziv factorization, runs, repetitions, decision tree, lower bounds} }