DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2026.36

Time-Optimal Construction of String Synchronizing Sets

Authors: Jonas Ellert and Tomasz Kociumaka

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

A powerful design principle behind many modern string algorithms is local consistency: breaking the symmetry between string positions based on their small contexts so that matching fragments are handled consistently. Among the most influential instantiations of this principle are string synchronizing sets [Kempa & Kociumaka; STOC 2019]. A τ-synchronizing set of a string of length n is a set of O(n/τ) string positions, chosen using their length-2τ contexts, such that (outside of highly periodic regions) every block of τ consecutive positions contains at least one element of the set. Synchronizing sets have found dozens of applications in diverse settings, from quantum and dynamic algorithms to fully compressed computation. In the classic word RAM model, particularly for strings over small alphabets, they enabled faster solutions to core problems in data compression, text indexing, and string similarity. In this work, we show that any string T ∈ [0 .. σ)ⁿ can be preprocessed in O(n log σ / log n) time so that, for any given integer τ ∈ [1 .. n], a τ-synchronizing set of T can be constructed in O((n log τ)/(τ log n)) time. Both bounds are optimal in the word RAM model with machine word size w = Θ(log n), matching the information-theoretic minimum for the input and output sizes, respectively. Previously, constructing a τ-synchronizing set required O(n/τ) time after an O(n)-time preprocessing [Kociumaka, Radoszewski, Rytter, and Waleń; SICOMP 2024], or, in the restricted regime of τ < 0.2 log_σ n, without any preprocessing needed [Kempa & Kociumaka; STOC 2019]. A simple instantiation of our method outputs the synchronizing set as a sorted list in O(n/τ) time, or as a bitmask in O(n/log n) time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in O(1) time and rank queries in O(log ((log τ)/(log log n))) time. The latter complexity matches known unconditional cell-probe lower bounds for τ ≤ n^{1-Ω(1)}. To achieve this, we introduce a general framework for efficiently processing sparse integer sequences via a custom variable-length encoding. We also augment the optimal variant of van Emde Boas trees [Pătraşcu & Thorup; STOC 2006] with a deterministic linear-time construction. When the set is represented as a bitmask under our sparse encoding, the same guarantees for select and rank queries hold after preprocessing in time proportional to the size of our encoding (in words).

Cite as

Jonas Ellert and Tomasz Kociumaka. Time-Optimal Construction of String Synchronizing Sets. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 36:1-36:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{ellert_et_al:LIPIcs.STACS.2026.36,
  author =	{Ellert, Jonas and Kociumaka, Tomasz},
  title =	{{Time-Optimal Construction of String Synchronizing Sets}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{36:1--36:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.36},
  URN =		{urn:nbn:de:0030-drops-255258},
  doi =		{10.4230/LIPIcs.STACS.2026.36},
  annote =	{Keywords: synchronizing sets, local consistency, packed strings}
}

Document

DOI: 10.4230/LIPIcs.STACS.2026.68

Dynamic Pattern Matching with Wildcards

Authors: Arshia Ataee Naeini, Amir-Parsa Mobed, Masoud Seddighin, and Saeed Seddighin

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

We study the fully dynamic pattern matching problem where the pattern may contain up to k wildcard symbols, each matching any symbol of the alphabet. Both the text and the pattern are subject to updates (insert, delete, change). We design an algorithm with 𝒪(n log² n) preprocessing and update/query time 𝒪̃(kn^{k/{k+1}} + k² log n). The bound is truly sublinear for a constant k, and sublinear when k = o(log n). We further complement our results with a conditional lower bound: assuming subquadratic preprocessing time, achieving truly sublinear update time for the case k = Ω(log n) would contradict the Strong Exponential Time Hypothesis (SETH). Finally, we develop sublinear algorithms for two special cases: - If the pattern contains w non-wildcard symbols, we give an algorithm with preprocessing time 𝒪(nw) and update time 𝒪(w + log n), which is truly sublinear whenever w is truly sublinear. - Using FFT technique combined with block decomposition, we design a deterministic truly sublinear algorithm with preprocessing time 𝒪(n^{1.8}) and update time 𝒪(n^{0.8} log n) for the case that there are at most two non-wildcards.

Cite as

Arshia Ataee Naeini, Amir-Parsa Mobed, Masoud Seddighin, and Saeed Seddighin. Dynamic Pattern Matching with Wildcards. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 68:1-68:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{naeini_et_al:LIPIcs.STACS.2026.68,
  author =	{Naeini, Arshia Ataee and Mobed, Amir-Parsa and Seddighin, Masoud and Seddighin, Saeed},
  title =	{{Dynamic Pattern Matching with Wildcards}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{68:1--68:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.68},
  URN =		{urn:nbn:de:0030-drops-255579},
  doi =		{10.4230/LIPIcs.STACS.2026.68},
  annote =	{Keywords: pattern matching, wildcards, dynamic algorithms, string algorithms, data structures}
}

Document

DOI: 10.4230/LIPIcs.IPEC.2025.20

Deterministically Counting k-Paths and Trees Parameterized by Treewidth in Single-Exponential Time

Authors: Jonne Visser and Hans L. Bodlaender

Published in: LIPIcs, Volume 358, 20th International Symposium on Parameterized and Exact Computation (IPEC 2025)

Abstract

In this paper, we give new and faster deterministic algorithms to count the number of k-paths and trees in host graphs of bounded treewidth. Our algorithms use time that is single-exponential in the treewidth, and employ the determinant method from [Hans L. Bodlaender et al., 2015]. Modifications of the algorithms count in single-exponential time the number of k-paths between specified end-points, the number of k-cycles, and the number of trees with k vertices that are a subgraph of the host graph.

Cite as

Jonne Visser and Hans L. Bodlaender. Deterministically Counting k-Paths and Trees Parameterized by Treewidth in Single-Exponential Time. In 20th International Symposium on Parameterized and Exact Computation (IPEC 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 358, pp. 20:1-20:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{visser_et_al:LIPIcs.IPEC.2025.20,
  author =	{Visser, Jonne and Bodlaender, Hans L.},
  title =	{{Deterministically Counting k-Paths and Trees Parameterized by Treewidth in Single-Exponential Time}},
  booktitle =	{20th International Symposium on Parameterized and Exact Computation (IPEC 2025)},
  pages =	{20:1--20:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-407-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{358},
  editor =	{Agrawal, Akanksha and van Leeuwen, Erik Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.IPEC.2025.20},
  URN =		{urn:nbn:de:0030-drops-251529},
  doi =		{10.4230/LIPIcs.IPEC.2025.20},
  annote =	{Keywords: Parameterized Complexity, Counting Subgraphs, #k-path, Dynamic Programming, Tree Decomposition, Determinant Method}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.55

Safe Sequences via Dominators in DAGs for Path-Covering Problems

Authors: Francisco Sena, Romeo Rizzi, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

A path-covering problem on a directed acyclic graph (DAG) requires finding a set of source-to-sink paths that cover all the nodes, all the arcs, or subsets thereof, and additionally they are optimal with respect to some function. In this paper we study safe sequences of nodes or arcs, namely sequences that appear in some path of every path cover of a DAG. We show that safe sequences admit a simple characterization via cutnodes. Moreover, we establish a connection between maximal safe sequences and leaf-to-root paths in the source- and sink-dominator trees of the DAG, which may be of independent interest in the extensive literature on dominators. With dominator trees, safe sequences admit an O(n)-size representation and a linear-time output-sensitive enumeration algorithm running in time O(m + o), where n and m are the number of nodes and arcs, respectively, and o is the total length of the maximal safe sequences. We then apply maximal safe sequences to simplify Integer Linear Programs (ILPs) for two path-covering problems, LeastSquares and MinPathError, which are at the core of RNA transcript assembly problems from bioinformatics. On various datasets, maximal safe sequences can be computed in under 0.1 seconds per graph, on average, and ILP solvers whose search space is reduced in this manner exhibit significant speed-ups. For example on graphs with a large width, average speed-ups are in the range 50-250× for MinPathError and in the range 80-350× for LeastSquares. Optimizing ILPs using safe sequences can thus become a fast building block of practical RNA transcript assembly tools, and more generally, of path-covering problems.

Cite as

Francisco Sena, Romeo Rizzi, and Alexandru I. Tomescu. Safe Sequences via Dominators in DAGs for Path-Covering Problems. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 55:1-55:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{sena_et_al:LIPIcs.ESA.2025.55,
  author =	{Sena, Francisco and Rizzi, Romeo and Tomescu, Alexandru I.},
  title =	{{Safe Sequences via Dominators in DAGs for Path-Covering Problems}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{55:1--55:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.55},
  URN =		{urn:nbn:de:0030-drops-245230},
  doi =		{10.4230/LIPIcs.ESA.2025.55},
  annote =	{Keywords: directed acyclic graph, path cover, dominator tree, integer linear programming, least squares, minimum path error}
}

Document

DOI: 10.4230/LIPIcs.ESA.2025.111

Hardness of Median and Center in the Ulam Metric

Authors: Nick Fischer, Elazar Goldenberg, Mursalin Habib, and Karthik C. S.

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The classical rank aggregation problem seeks to combine a set X of n permutations into a single representative "consensus" permutation. In this paper, we investigate two fundamental rank aggregation tasks under the well-studied Ulam metric: computing a median permutation (which minimizes the sum of Ulam distances to X) and computing a center permutation (which minimizes the maximum Ulam distance to X) in two settings. - Continuous Setting: In the continuous setting, the median/center is allowed to be any permutation. It is known that computing a center in the Ulam metric is NP-hard and we add to this by showing that computing a median is NP-hard as well via a simple reduction from the Max-Cut problem. While this result may not be unexpected, it had remained elusive until now and confirms a speculation by Chakraborty, Das, and Krauthgamer [SODA '21]. - Discrete Setting: In the discrete setting, the median/center must be a permutation from the input set. We fully resolve the fine-grained complexity of the discrete median and discrete center problems under the Ulam metric, proving that the naive Õ(n² L)-time algorithm (where L is the length of the permutation) is conditionally optimal. This resolves an open problem raised by Abboud, Bateni, Cohen-Addad, Karthik C. S., and Seddighin [APPROX '23]. Our reductions are inspired by the known fine-grained lower bounds for similarity measures, but we face and overcome several new highly technical challenges.

Cite as

Nick Fischer, Elazar Goldenberg, Mursalin Habib, and Karthik C. S.. Hardness of Median and Center in the Ulam Metric. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 111:1-111:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{fischer_et_al:LIPIcs.ESA.2025.111,
  author =	{Fischer, Nick and Goldenberg, Elazar and Habib, Mursalin and Karthik C. S.},
  title =	{{Hardness of Median and Center in the Ulam Metric}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{111:1--111:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.111},
  URN =		{urn:nbn:de:0030-drops-245809},
  doi =		{10.4230/LIPIcs.ESA.2025.111},
  annote =	{Keywords: Ulam distance, median, center, rank aggregation, fine-grained complexity}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.22

Design of Worst-Case-Optimal Spaced Seeds

Authors: Jens Zentgraf and Sven Rahmann

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Read mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many differences are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently long exact match (seed) must exist. While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namely spaced seeds or gapped k-mers. Given two integers k ≤ w, one considers k selected positions, described by a mask, from each length-w window in a sequence. In the existing literature, masks with certain probabilistic guarantees have been designed for small values of k. Here, for the first time, we take a combinatorial approach from a worst-case perspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks or all symmetric masks of a given shape (k,w), we find the set of best masks that maximize these minima. As a result, we obtain robust masks, even for large numbers of changes. We illustrate the properties of these masks by constructing a challenging set of reads that contain many approximately equidistributed substitutions (but no indels) that many existing tools cannot map, even though they are in principle easily mappable (apart from the large number of changes) because they originate from selected non-repetitive regions of the human reference genome. We observe that the majority of these reads can be mapped with a simple alignment-free approach using chosen spaced masks, where seeding approaches based on contiguous k-mers fail.

Cite as

Jens Zentgraf and Sven Rahmann. Design of Worst-Case-Optimal Spaced Seeds. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{zentgraf_et_al:LIPIcs.WABI.2025.22,
  author =	{Zentgraf, Jens and Rahmann, Sven},
  title =	{{Design of Worst-Case-Optimal Spaced Seeds}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{22:1--22:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.22},
  URN =		{urn:nbn:de:0030-drops-239488},
  doi =		{10.4230/LIPIcs.WABI.2025.22},
  annote =	{Keywords: Spaced seed, Gapped k-mer, Integer linear program (ILP), Worst-case design, Reference bias}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.9

Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer

Authors: Daniel W. Feng and Mohammed El-Kebir

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Several recent cancer phylogeny inference methods have used the k-Dollo evolutionary model for single-nucleotide variants. Specifically, in this problem one is given an m × n binary matrix B and seeks a rooted tree T with m leaves that correspond to the m rows of B, and each node of T is labeled by a binary state for each of the n characters subject to the restriction that each character is gained at most once (0-to-1 transition) and subsequently lost at most k times (1-to-0 transitions). The 1-Dollo variant, also known as the persistent perfect phylogeny where one is restricted to at most k = 1 losses per character, has been studied extensively, but its hardness remains an open question. Here, we prove that the 1-Dollo Linear Phylogeny (1DLP) problem, where we additionally require the resulting 1-Dollo phylogeny T to be linear, is equivalent to verifying whether the input matrix B adheres to the Consecutive Ones Property (C1P), which can be solved in polynomial time. Due to the equivalence, several known NP-hardness results for relevant variants of C1P carry over to 1DLP, including the minimization of false negatives (0-to-1 modifications to the input matrix B) or the allowance of 2 gains and 2 losses. We furthermore show how we can recursively decompose any, not necessarily linear, 1-Dollo phylogeny T into several 1-Dollo linear phylogenies, connected by matching branching points. We extend this characterization to matrices B that admit 1-Dollo phylogenies, giving necessary and sufficient conditions for the existence of a novel decomposition of B into several submatrices and corresponding branching points. This decomposition forms the basis of Dolphyin, a new exponential-time algorithm for inferring 1-Dollo phylogenies that efficiently leverages the determination of linear 1-Dollo phylogenies as a subroutine. Dolphyin can also be applied to input matrices B with false negatives. We demonstrate that Dolphyin is runtime-competitive with a previous integer linear programming based algorithm SPhyR on simulated datasets. We additionally analyze simulated datasets with false negative errors and find that in the median case, Dolphyin infers 1-Dollo phylogenies with inferred error rates at or below the ground truth rate. Finally, we apply Dolphyin to 99 acute myeloid leukemia single-cell sequencing datasets, finding that the majority of the cancers can be explained by 1-Dollo phylogenies with false negative error rates in line with the used sequencing technology. Availability. Dolphyin is available at: https://github.com/elkebir-group/Dolphyin.

Cite as

Daniel W. Feng and Mohammed El-Kebir. Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 9:1-9:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{feng_et_al:LIPIcs.WABI.2025.9,
  author =	{Feng, Daniel W. and El-Kebir, Mohammed},
  title =	{{Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{9:1--9:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.9},
  URN =		{urn:nbn:de:0030-drops-239356},
  doi =		{10.4230/LIPIcs.WABI.2025.9},
  annote =	{Keywords: Intra-tumor heterogeneity, persistent perfect phylogeny, consecutive ones property, combinatorics}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.5

Improved Algorithms for Bi-Partition Function Computation

Authors: John D. Bridgers, Jan Hoinka, S. Cenk Sahinalp, Salem Malikic, Teresa M. Przytycka, and Funda Ergun

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

The evolutionary history of a tumor, inferred from single-cell sequencing data, is typically represented as a tree in which each subtree corresponds to a clade of cells seeded by a specific set of mutations. Traditional methods typically identify a single most likely tree for downstream analyses, such as detecting driver mutations, studying mutation co-occurrence patterns and identifying common evolutionary trajectories. However, the reliability of such inferred trees, particularly their topology, clade composition, and mutational placements, often remains uncertain. To quantify this uncertainty, the concept of a Bi-partition Function was recently introduced, providing a probabilistic measure of how reliably a mutation seeds a given clade of cells. The single available algorithm for estimating the Bi-partition Function relies on simplifying assumptions and uses sampling for limited exploration of the tree-space. In this paper, we introduce the first exact algorithm for computing the Bi-partition Function. Our algorithm scales linearly with the number of mutations but exhibits super-exponential complexity with respect to the number of cells. Despite this complexity, it establishes crucial ground truth values, essential for accurately benchmarking and validating approximate methods. Additionally, we present a GPU-accelerated version of the available sampling-based algorithm, significantly boosting the computational performance through large-scale parallelization, enabling more accurate Bi-partition Function estimates via deeper exploration of the tree spaces. We compare our methods on synthetic datasets, demonstrating that especially when the number of mutations sufficiently exceed the number of cells, our GPU-accelerated sampling algorithm closely approximates the exact ground truth values.

Cite as

John D. Bridgers, Jan Hoinka, S. Cenk Sahinalp, Salem Malikic, Teresa M. Przytycka, and Funda Ergun. Improved Algorithms for Bi-Partition Function Computation. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 5:1-5:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bridgers_et_al:LIPIcs.WABI.2025.5,
  author =	{Bridgers, John D. and Hoinka, Jan and Sahinalp, S. Cenk and Malikic, Salem and Przytycka, Teresa M. and Ergun, Funda},
  title =	{{Improved Algorithms for Bi-Partition Function Computation}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{5:1--5:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.5},
  URN =		{urn:nbn:de:0030-drops-239318},
  doi =		{10.4230/LIPIcs.WABI.2025.5},
  annote =	{Keywords: Tumor Evolution, Bi-partition Function, Single-Cell Sequencing, Algorithms}
}

Document

DOI: 10.4230/OASIcs.Manzini.9

Search Schemes for Approximate Pattern Matching: An Overview

Authors: Lore Depuydt, Jan Fostier, Simon Gottlieb, Gregory Kucherov, Knut Reinert, and Luca Renders

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

We provide a brief survey of results on solving the approximate pattern matching problem using search schemes, as introduced by Kucherov et al. (2016). We demonstrate that search schemes constitute a flexible and versatile tool that enable the specification of various search strategies, including several known filtering methods. We present approaches for designing efficient search schemes and for implementing them effectively. Finally, we conclude with experimental results comparing multiple search schemes on DNA sequencing data using the Columba software by Renders et al. (2021).

Cite as

Lore Depuydt, Jan Fostier, Simon Gottlieb, Gregory Kucherov, Knut Reinert, and Luca Renders. Search Schemes for Approximate Pattern Matching: An Overview. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 9:1-9:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{depuydt_et_al:OASIcs.Manzini.9,
  author =	{Depuydt, Lore and Fostier, Jan and Gottlieb, Simon and Kucherov, Gregory and Reinert, Knut and Renders, Luca},
  title =	{{Search Schemes for Approximate Pattern Matching: An Overview}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{9:1--9:16},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.9},
  URN =		{urn:nbn:de:0030-drops-239172},
  doi =		{10.4230/OASIcs.Manzini.9},
  annote =	{Keywords: FM-index, bidirectional index, approximate pattern matching, search scheme}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.14

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Authors: Diego Díaz-Domínguez

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

We present compression algorithms designed to process terabyte-sized datasets in parallel. Our approach builds on locally consistent grammars, a lightweight form of compression, combined with simple post-processing techniques to achieve further space reductions. Locally consistent grammar algorithms are suitable for scaling as they need minimal satellite information to compact the text, but they are not inherently parallel. To enable parallelisation, we introduce a novel concept that we call stable local consistency. A grammar algorithm ALG is stable if for any pattern P occurring in a collection 𝒯 = {T_1, T_2, …, T_k}, instances ALG(T_1), ALG(T_2), …, ALG(T_k) independently produce cores for P with the same topology. In a locally consistent grammar, the core of P is a subset of nodes and edges in the parse tree of 𝒯 that remains the same in all the occurrences of P. This feature enables compression, but it only holds if ALG defines a common set of nonterminal symbols for the strings. Stability removes this restriction, allowing us to run ALG(T_1), ALG(T_2), …, ALG(T_k) in parallel and subsequently merge their grammars into a single output equivalent to that of ALG(𝒯). We implemented our ideas and tested them on massive datasets. Our experiments showed that our method could process 7.9 TB of bacterial genomes in around nine hours, using 16 threads and 0.43 bits/symbol of working memory, achieving a compression ratio of 85x.

Cite as

Diego Díaz-Domínguez. Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 14:1-14:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{diazdominguez:LIPIcs.SEA.2025.14,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego},
  title =	{{Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{14:1--14:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.14},
  URN =		{urn:nbn:de:0030-drops-232525},
  doi =		{10.4230/LIPIcs.SEA.2025.14},
  annote =	{Keywords: Grammar compression, locally consistent parsing, hashing}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.22

A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression

Authors: Nikolai Karpov, Salem Malikic, Md. Khaledur Rahman, and S. Cenk Sahinalp

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

We introduce a new edit distance measure between a pair of "clonal trees", each representing the progression and mutational heterogeneity of a tumor sample, constructed by the use of single cell or bulk high throughput sequencing data. In a clonal tree, each vertex represents a specific tumor clone, and is labeled with one or more mutations in a way that each mutation is assigned to the oldest clone that harbors it. Given two clonal trees, our multi-labeled tree edit distance (MLTED) measure is defined as the minimum number of mutation/label deletions, (empty) leaf deletions, and vertex (clonal) expansions, applied in any order, to convert each of the two trees to the maximal common tree. We show that the MLTED measure can be computed efficiently in polynomial time and it captures the similarity between trees of different clonal granularity well. We have implemented our algorithm to compute MLTED exactly and applied it to a variety of data sets successfully. The source code of our method can be found in: https://github.com/khaled-rahman/leafDelTED.

Cite as

Nikolai Karpov, Salem Malikic, Md. Khaledur Rahman, and S. Cenk Sahinalp. A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 22:1-22:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{karpov_et_al:LIPIcs.WABI.2018.22,
  author =	{Karpov, Nikolai and Malikic, Salem and Rahman, Md. Khaledur and Sahinalp, S. Cenk},
  title =	{{A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{22:1--22:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.22},
  URN =		{urn:nbn:de:0030-drops-93242},
  doi =		{10.4230/LIPIcs.WABI.2018.22},
  annote =	{Keywords: Intra-tumor heterogeneity, tumor evolution, multi-labeled tree, tree edit distance, dynamic programming}
}

11 Search Results for "Sahinalp, S. Cenk"

Time-Optimal Construction of String Synchronizing Sets

Abstract

Cite as

Dynamic Pattern Matching with Wildcards

Abstract

Cite as

Deterministically Counting k-Paths and Trees Parameterized by Treewidth in Single-Exponential Time

Abstract

Cite as

Safe Sequences via Dominators in DAGs for Path-Covering Problems

Abstract

Cite as

Hardness of Median and Center in the Ulam Metric

Abstract

Cite as

Design of Worst-Case-Optimal Spaced Seeds

Abstract

Cite as

Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer

Abstract

Cite as

Improved Algorithms for Bi-Partition Function Computation

Abstract

Cite as

Search Schemes for Approximate Pattern Matching: An Overview

Abstract

Cite as

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Abstract

Cite as

A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression

Abstract

Cite as

Thanks for your feedback!

Could not send message