DROPS

Volume

LIPIcs, Volume 172

20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference)

Editors: Carl Kingsford and Nadia Pisanti

Document

DOI: 10.4230/LIPIcs.ESA.2025.55

Safe Sequences via Dominators in DAGs for Path-Covering Problems

Authors: Francisco Sena, Romeo Rizzi, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

A path-covering problem on a directed acyclic graph (DAG) requires finding a set of source-to-sink paths that cover all the nodes, all the arcs, or subsets thereof, and additionally they are optimal with respect to some function. In this paper we study safe sequences of nodes or arcs, namely sequences that appear in some path of every path cover of a DAG. We show that safe sequences admit a simple characterization via cutnodes. Moreover, we establish a connection between maximal safe sequences and leaf-to-root paths in the source- and sink-dominator trees of the DAG, which may be of independent interest in the extensive literature on dominators. With dominator trees, safe sequences admit an O(n)-size representation and a linear-time output-sensitive enumeration algorithm running in time O(m + o), where n and m are the number of nodes and arcs, respectively, and o is the total length of the maximal safe sequences. We then apply maximal safe sequences to simplify Integer Linear Programs (ILPs) for two path-covering problems, LeastSquares and MinPathError, which are at the core of RNA transcript assembly problems from bioinformatics. On various datasets, maximal safe sequences can be computed in under 0.1 seconds per graph, on average, and ILP solvers whose search space is reduced in this manner exhibit significant speed-ups. For example on graphs with a large width, average speed-ups are in the range 50-250× for MinPathError and in the range 80-350× for LeastSquares. Optimizing ILPs using safe sequences can thus become a fast building block of practical RNA transcript assembly tools, and more generally, of path-covering problems.

Cite as

Francisco Sena, Romeo Rizzi, and Alexandru I. Tomescu. Safe Sequences via Dominators in DAGs for Path-Covering Problems. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 55:1-55:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{sena_et_al:LIPIcs.ESA.2025.55,
  author =	{Sena, Francisco and Rizzi, Romeo and Tomescu, Alexandru I.},
  title =	{{Safe Sequences via Dominators in DAGs for Path-Covering Problems}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{55:1--55:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.55},
  URN =		{urn:nbn:de:0030-drops-245230},
  doi =		{10.4230/LIPIcs.ESA.2025.55},
  annote =	{Keywords: directed acyclic graph, path cover, dominator tree, integer linear programming, least squares, minimum path error}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.20

A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences

Authors: Haonan Wu, Antonio Blanca, and Paul Medvedev

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

K-mer-based analysis of genomic data is ubiquitous, but the presence of repetitive k-mers continues to pose problems for the accuracy of many methods. For example, the Mash tool (Ondov et al. 2016) can accurately estimate the substitution rate between two low-repetitive sequences from their k-mer sketches; however, it is inaccurate on repetitive sequences such as the centromere of a human chromosome. Follow-up work by Blanca et al. (2021) has attempted to model how mutations affect k-mer sets based on strong assumptions that the sequence is non-repetitive and that mutations do not create spurious k-mer matches. However, the theoretical foundations for extending an estimator like Mash to work in the presence of repeat sequences have been lacking. In this work, we relax the non-repetitive assumption and propose a novel estimator for the mutation rate. We derive theoretical bounds on our estimator’s bias. Our experiments show that it remains accurate for repetitive genomic sequences, such as the alpha satellite higher order repeats in centromeres. We demonstrate our estimator’s robustness across diverse datasets and various ranges of the substitution rate and k-mer size. Finally, we show how sketching can be used to avoid dealing with large k-mer sets while retaining accuracy. Our software is available at https://github.com/medvedevgroup/Repeat-Aware_Substitution_Rate_Estimator.

Cite as

Haonan Wu, Antonio Blanca, and Paul Medvedev. A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 20:1-20:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{wu_et_al:LIPIcs.WABI.2025.20,
  author =	{Wu, Haonan and Blanca, Antonio and Medvedev, Paul},
  title =	{{A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{20:1--20:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.20},
  URN =		{urn:nbn:de:0030-drops-239465},
  doi =		{10.4230/LIPIcs.WABI.2025.20},
  annote =	{Keywords: k-mers, sketching, mutation rates}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.16

Estimation of Substitution and Indel Rates via k-mer Statistics

Authors: Mahmudur Rahman Hera, Paul Medvedev, David Koslicki, and Antonio Blanca

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Methods utilizing k-mers are widely used in bioinformatics, yet our understanding of their statistical properties under realistic mutation models remains incomplete. Previously, substitution-only mutation models have been considered to derive precise expectations and variances for mutated k-mers and intervals of mutated and non-mutated sequences. In this work, we consider a mutation model that incorporates insertions and deletions in addition to single-nucleotide substitutions. Within this framework, we derive closed-form k-mer-based estimators for the three fundamental mutation parameters: substitution, deletion rate, and insertion rates. We provide theoretical guarantees in the form of concentration inequalities, ensuring accuracy of our estimators under reasonable model assumptions. Empirical evaluations on simulated evolution of genomic sequences confirm our theoretical findings, demonstrating that accounting for insertions and deletions signals allows for accurate estimation of mutation rates and improves upon the results obtained by considering a substitution-only model. An implementation of estimating the mutation parameters from a pair of fasta files is available here: https://github.com/KoslickiLab/estimate_rates_using_mutation_model.git. The results presented in this manuscript can be reproduced using the code available here: https://github.com/KoslickiLab/est_rates_experiments.git.

Cite as

Mahmudur Rahman Hera, Paul Medvedev, David Koslicki, and Antonio Blanca. Estimation of Substitution and Indel Rates via k-mer Statistics. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{rahmanhera_et_al:LIPIcs.WABI.2025.16,
  author =	{Rahman Hera, Mahmudur and Medvedev, Paul and Koslicki, David and Blanca, Antonio},
  title =	{{Estimation of Substitution and Indel Rates via k-mer Statistics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{16:1--16:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.16},
  URN =		{urn:nbn:de:0030-drops-239422},
  doi =		{10.4230/LIPIcs.WABI.2025.16},
  annote =	{Keywords: k-mers, mutation rate, indel, alignment-free, estimation, substitution, insertion, deletion}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.7

Sequence Similarity Estimation by Random Subsequence Sketching

Authors: Ke Chen, Vinamratha Pattar, and Mingfu Shao

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or k-mers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k-value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch.

Cite as

Ke Chen, Vinamratha Pattar, and Mingfu Shao. Sequence Similarity Estimation by Random Subsequence Sketching. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{chen_et_al:LIPIcs.WABI.2025.7,
  author =	{Chen, Ke and Pattar, Vinamratha and Shao, Mingfu},
  title =	{{Sequence Similarity Estimation by Random Subsequence Sketching}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{7:1--7:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.7},
  URN =		{urn:nbn:de:0030-drops-239332},
  doi =		{10.4230/LIPIcs.WABI.2025.7},
  annote =	{Keywords: Alignment-free sequence comparison, Phylogenetic clustering, Nearest neighbor search, Edit distance embedding}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.3

Approximability of Longest Run Subsequence and Complementary Minimization Problems

Authors: Yuichi Asahiro, Mingyang Gong, Jesper Jansson, Guohui Lin, Sichen Lu, Eiji Miyano, Hirotaka Ono, Toshiki Saitoh, and Shunichi Tanaka

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

We study the polynomial-time approximability of the Longest Run Subsequence problem (LRS for short) and its complementary minimization variant Minimum Run Subsequence Deletion problem (MRSD for short). For a string S = s₁ ⋯ s_n over an alphabet Σ, a subsequence S' of S is S' = s_{i₁} ⋯ s_{i_p}, such that 1 ≤ i₁ < i₂ < … < i_p ≤ |S|. A run of a symbol σ ∈ Σ in S is a maximal substring of consecutive occurrences of σ. A run subsequence S' of S is a subsequence of S in which every symbol σ ∈ Σ occurs in at most one run. The co-subsequence ̅{S'} of the subsequence S' = s_{i₁} ⋯ s_{i_p} in S is the subsequence obtained by deleting all the characters in S' from S, i.e., ̅{S'} = s_{j₁} ⋯ s_{j_{n-p}} such that j₁ < j₂ < … < j_{n-p} and {j₁, …, j_{n-p}} = {1, …, n}⧵ {i₁, …, i_p}. Given a string S, the goal of LRS (resp., MRSD) is to find a run subsequence S^* of S such that the length |S^*| is maximized (resp., the number | ̅{S^*}| of deleted symbols from S is minimized) over all the run subsequences of S. Let k be the maximum number of symbol occurrences in the input S. It is known that LRS and MRSD are APX-hard even if k = 2. In this paper, we show that LRS can be approximated in polynomial time within factors of (k+2)/3 for k = 2 or 3, and 2(k+1)/5 for every k ≥ 4. Furthermore, we show that MRSD can be approximated in linear time within a factor of (k+4)/4 if k is even and (k+3)/4 if k is odd.

Cite as

Yuichi Asahiro, Mingyang Gong, Jesper Jansson, Guohui Lin, Sichen Lu, Eiji Miyano, Hirotaka Ono, Toshiki Saitoh, and Shunichi Tanaka. Approximability of Longest Run Subsequence and Complementary Minimization Problems. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 3:1-3:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{asahiro_et_al:LIPIcs.WABI.2025.3,
  author =	{Asahiro, Yuichi and Gong, Mingyang and Jansson, Jesper and Lin, Guohui and Lu, Sichen and Miyano, Eiji and Ono, Hirotaka and Saitoh, Toshiki and Tanaka, Shunichi},
  title =	{{Approximability of Longest Run Subsequence and Complementary Minimization Problems}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{3:1--3:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.3},
  URN =		{urn:nbn:de:0030-drops-239290},
  doi =		{10.4230/LIPIcs.WABI.2025.3},
  annote =	{Keywords: Longest run subsequence, minimum run subsequence deletion, approximation algorithm}
}

@InProceedings{asahiro_et_al:LIPIcs.WABI.2025.3,
  author =	{Asahiro, Yuichi and Gong, Mingyang and Jansson, Jesper and Lin, Guohui and Lu, Sichen and Miyano, Eiji and Ono, Hirotaka and Saitoh, Toshiki and Tanaka, Shunichi},
  title =	{{Approximability of Longest Run Subsequence and Complementary Minimization Problems}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{3:1--3:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.3},
  URN =		{urn:nbn:de:0030-drops-239290},
  doi =		{10.4230/LIPIcs.WABI.2025.3},
  annote =	{Keywords: Longest run subsequence, minimum run subsequence deletion, approximation algorithm}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.9

Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer

Authors: Daniel W. Feng and Mohammed El-Kebir

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Several recent cancer phylogeny inference methods have used the k-Dollo evolutionary model for single-nucleotide variants. Specifically, in this problem one is given an m × n binary matrix B and seeks a rooted tree T with m leaves that correspond to the m rows of B, and each node of T is labeled by a binary state for each of the n characters subject to the restriction that each character is gained at most once (0-to-1 transition) and subsequently lost at most k times (1-to-0 transitions). The 1-Dollo variant, also known as the persistent perfect phylogeny where one is restricted to at most k = 1 losses per character, has been studied extensively, but its hardness remains an open question. Here, we prove that the 1-Dollo Linear Phylogeny (1DLP) problem, where we additionally require the resulting 1-Dollo phylogeny T to be linear, is equivalent to verifying whether the input matrix B adheres to the Consecutive Ones Property (C1P), which can be solved in polynomial time. Due to the equivalence, several known NP-hardness results for relevant variants of C1P carry over to 1DLP, including the minimization of false negatives (0-to-1 modifications to the input matrix B) or the allowance of 2 gains and 2 losses. We furthermore show how we can recursively decompose any, not necessarily linear, 1-Dollo phylogeny T into several 1-Dollo linear phylogenies, connected by matching branching points. We extend this characterization to matrices B that admit 1-Dollo phylogenies, giving necessary and sufficient conditions for the existence of a novel decomposition of B into several submatrices and corresponding branching points. This decomposition forms the basis of Dolphyin, a new exponential-time algorithm for inferring 1-Dollo phylogenies that efficiently leverages the determination of linear 1-Dollo phylogenies as a subroutine. Dolphyin can also be applied to input matrices B with false negatives. We demonstrate that Dolphyin is runtime-competitive with a previous integer linear programming based algorithm SPhyR on simulated datasets. We additionally analyze simulated datasets with false negative errors and find that in the median case, Dolphyin infers 1-Dollo phylogenies with inferred error rates at or below the ground truth rate. Finally, we apply Dolphyin to 99 acute myeloid leukemia single-cell sequencing datasets, finding that the majority of the cancers can be explained by 1-Dollo phylogenies with false negative error rates in line with the used sequencing technology. Availability. Dolphyin is available at: https://github.com/elkebir-group/Dolphyin.

Cite as

Daniel W. Feng and Mohammed El-Kebir. Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 9:1-9:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{feng_et_al:LIPIcs.WABI.2025.9,
  author =	{Feng, Daniel W. and El-Kebir, Mohammed},
  title =	{{Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{9:1--9:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.9},
  URN =		{urn:nbn:de:0030-drops-239356},
  doi =		{10.4230/LIPIcs.WABI.2025.9},
  annote =	{Keywords: Intra-tumor heterogeneity, persistent perfect phylogeny, consecutive ones property, combinatorics}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.20

SimdMinimizers: Computing Random Minimizers, fast

Authors: Ragnar Groot Koerkamp and Igor Martayan

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

Motivation. Because of the rapidly-growing amount of sequencing data, computing sketches of large textual datasets has become an essential preprocessing task. These sketches are typically much smaller than the input sequences, but preserve sufficient information for downstream analysis. Minimizers are an especially popular sketching technique and used in a wide variety of applications. They sample at least one out of every w consecutive k-mers. As DNA sequencers are getting more accurate, some applications can afford to use a larger w and hence sparser and smaller sketches. And as sketches get smaller, their analysis becomes faster, so the time spent sketching the full-sized input becomes more of a bottleneck. Methods. Our library simd-minimizers implements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold. First, it splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministic two-stacks sliding window minimum algorithm, which seems not to have been used before for finding minimizers. Results. Our library is up to 6.8× faster than a scalar implementation of the rescan method when w = 5 is small, and 3.4× faster for larger w = 19. Computing canonical minimizers is less than 50% slower than computing forward minimizers, and over 15× faster than the existing implementation in the minimizer-iter crate. Our library finds all (canonical) minimizers of a 3.2 Gbp human genome in 5.2 (resp. 6.7) seconds.

Cite as

Ragnar Groot Koerkamp and Igor Martayan. SimdMinimizers: Computing Random Minimizers, fast. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 20:1-20:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{grootkoerkamp_et_al:LIPIcs.SEA.2025.20,
  author =	{Groot Koerkamp, Ragnar and Martayan, Igor},
  title =	{{SimdMinimizers: Computing Random Minimizers, fast}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{20:1--20:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.20},
  URN =		{urn:nbn:de:0030-drops-232581},
  doi =		{10.4230/LIPIcs.SEA.2025.20},
  annote =	{Keywords: Minimizers, Randomized algorithms, Sketching, Hashing}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.4

U-Index: A Universal Indexing Framework for Matching Long Patterns

Authors: Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P. Pissis

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

Motivation. Text indexing is a fundamental and well-studied problem. Classic solutions to this problem either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching, e.g., the suffix array. The former solutions thus retain excellent compressed space, but are practically slow to construct and query. The latter approaches, instead, sacrifice space efficiency but are typically faster; for example, the suffix array takes much more space than the text itself for commonly used alphabets, like ASCII or DNA, but it is very fast to construct and query. Methods. In this paper, we show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. Results. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so we demonstrate applications in long-read mapping.

Cite as

Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P. Pissis. U-Index: A Universal Indexing Framework for Matching Long Patterns. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ayad_et_al:LIPIcs.SEA.2025.4,
  author =	{Ayad, Lorraine A. K. and Fici, Gabriele and Groot Koerkamp, Ragnar and Loukides, Grigorios and Patro, Rob and Pibiri, Giulio Ermanno and Pissis, Solon P.},
  title =	{{U-Index: A Universal Indexing Framework for Matching Long Patterns}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{4:1--4:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.4},
  URN =		{urn:nbn:de:0030-drops-232420},
  doi =		{10.4230/LIPIcs.SEA.2025.4},
  annote =	{Keywords: Text Indexing, Sketching, Minimizers, Hashing}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2025.42

Fitting Tree Metrics and Ultrametrics in Data Streams

Authors: Amir Carmel, Debarati Das, Evangelos Kipouridis, and Evangelos Pipis

Published in: LIPIcs, Volume 334, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)

Abstract

Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Formally, given a positive distance function D: binom(V,2) → ℝ_{>0}, the goal is to find a tree (or an ultrametric) T including all elements of set V, such that the difference between the distances among vertices in T and those specified by D is minimized. Numerical taxonomy was first introduced by Sneath and Sokal [Nature 1962], and since then it has been studied extensively in both biology and computer science. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from V (with |V| = n), defined by the function D, can arrive in an arbitrary order. We study these problems under various distance norms; namely the 𝓁₀ objective, which aims to minimize the number of modified entries in D to fit a tree-metric or an ultrametric; the 𝓁₁ objective, which seeks to minimize the total sum of distance errors across all pairs of points in V; and the 𝓁_∞ objective, which focuses on minimizing the maximum error incurred by any entries in D. - Our first result addresses the 𝓁₀ objective. We provide a single-pass polynomial-time Õ(n)-space O(1) approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. - Next, we show that the algorithm for 𝓁₀ implies an O(Δ/δ) approximation for the 𝓁₁ objective, where Δ is the maximum, and δ is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when Δ/δ = O(n). - For the 𝓁_∞ objective, we provide a complete characterization of the ultrametric fitting problem. First, we present a single-pass polynomial-time Õ(n)-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. Furthermore, we show that with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. - Finally, we extend all these results to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor.

Cite as

Amir Carmel, Debarati Das, Evangelos Kipouridis, and Evangelos Pipis. Fitting Tree Metrics and Ultrametrics in Data Streams. In 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 334, pp. 42:1-42:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{carmel_et_al:LIPIcs.ICALP.2025.42,
  author =	{Carmel, Amir and Das, Debarati and Kipouridis, Evangelos and Pipis, Evangelos},
  title =	{{Fitting Tree Metrics and Ultrametrics in Data Streams}},
  booktitle =	{52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)},
  pages =	{42:1--42:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-372-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{334},
  editor =	{Censor-Hillel, Keren and Grandoni, Fabrizio and Ouaknine, Jo\"{e}l and Puppis, Gabriele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2025.42},
  URN =		{urn:nbn:de:0030-drops-234197},
  doi =		{10.4230/LIPIcs.ICALP.2025.42},
  annote =	{Keywords: Streaming, Clustering, Ultrametrics, Tree metrics, Distance fitting}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.29

Faster Approximate Elastic-Degenerate String Matching - Part B

Authors: Paweł Gawrychowski, Adam Górkiewicz, Pola Marciniak, Solon P. Pissis, and Karol Pokorski

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of n finite sets of strings of total length N, and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences. The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length m, check if the pattern occurs in one of the strings in the described collection. Bernardini et al. [SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an Õ(nm^{ω-1})+𝒪(N)-time complexity for this problem, where ω is the matrix multiplication exponent. However, from the point of view of possible applications, it is more desirable to work with approximate pattern matching, where we seek approximate occurrences of the pattern. This generalization has been considered in a few papers already, but the best result so far for occurrences with k mismatches, where k is a constant, is the Õ(nm²+N)-time algorithm presented in Part A [CPM 2025]. This brings the question whether increasing the dependency on m from m^{ω-1} to quadratic is necessary when moving from k = 0 to larger (but still constant) k. We design an Õ(nm^{1.5}+N)-time algorithm for pattern matching with k mismatches in an elastic-degenerate string, for any constant k. To obtain this time bound, we leverage the structural characterization of occurrences with k mismatches of Charalampopoulos, Kociumaka, and Wellnitz [FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.

Cite as

Paweł Gawrychowski, Adam Górkiewicz, Pola Marciniak, Solon P. Pissis, and Karol Pokorski. Faster Approximate Elastic-Degenerate String Matching - Part B. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 29:1-29:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{gawrychowski_et_al:LIPIcs.CPM.2025.29,
  author =	{Gawrychowski, Pawe{\l} and G\'{o}rkiewicz, Adam and Marciniak, Pola and Pissis, Solon P. and Pokorski, Karol},
  title =	{{Faster Approximate Elastic-Degenerate String Matching - Part B}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{29:1--29:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.29},
  URN =		{urn:nbn:de:0030-drops-231236},
  doi =		{10.4230/LIPIcs.CPM.2025.29},
  annote =	{Keywords: ED string, approximate pattern matching, Hamming distance, k mismatches}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.11

Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants

Authors: Yutong Qiu, Yihang Shen, and Carl Kingsford

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs will always yield optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics that estimate GTED efficiently.

Cite as

Yutong Qiu, Yihang Shen, and Carl Kingsford. Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 11:1-11:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{qiu_et_al:LIPIcs.WABI.2023.11,
  author =	{Qiu, Yutong and Shen, Yihang and Kingsford, Carl},
  title =	{{Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{11:1--11:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.11},
  URN =		{urn:nbn:de:0030-drops-186374},
  doi =		{10.4230/LIPIcs.WABI.2023.11},
  annote =	{Keywords: Integer Linear Programming, Genome Graphs, Flow Network, Graph Comparison}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.23

Reinforcement Learning for Robotic Liquid Handler Planning

Authors: Mohsen Ferdosi, Yuejun Ge, and Carl Kingsford

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

Robotic liquid handlers play a crucial role in automating laboratory tasks such as sample preparation, high-throughput screening, and assay development. Manually designing protocols takes significant effort, and can result in inefficient protocols and involve human error. We investigates the application of reinforcement learning to automate the protocol design process resulting in reduced human labor and further automation in liquid handling. We develop a reinforcement learning agent that can automatically output the step-by-step protocol based on the initial state of the deck, reagent types and volumes, and the desired state of the reagents after the protocol is finished. We show that finding the optimal protocol for solving a liquid handler instance is NP-complete, and we present a reinforcement learning algorithm that can solve the planning problem practically for cases with a deck of up to 20 × 20 wells and four different types of reagents. We design and implement an actor-critic approach, and we train our agent using the Impala algorithm. Our findings demonstrate that reinforcement learning can be used to automatically program liquid handler robotic arms, enabling more precise and efficient planning for the liquid handler and laboratory automation.

Cite as

Mohsen Ferdosi, Yuejun Ge, and Carl Kingsford. Reinforcement Learning for Robotic Liquid Handler Planning. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 23:1-23:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{ferdosi_et_al:LIPIcs.WABI.2023.23,
  author =	{Ferdosi, Mohsen and Ge, Yuejun and Kingsford, Carl},
  title =	{{Reinforcement Learning for Robotic Liquid Handler Planning}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{23:1--23:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.23},
  URN =		{urn:nbn:de:0030-drops-186494},
  doi =		{10.4230/LIPIcs.WABI.2023.23},
  annote =	{Keywords: Liquid Handler, Reinforcement Learning, Planning}
}

Document

Complete Volume

DOI: 10.4230/LIPIcs.WABI.2020

LIPIcs, Volume 172, WABI 2020, Complete Volume

Authors: Carl Kingsford and Nadia Pisanti

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

LIPIcs, Volume 172, WABI 2020, Complete Volume

Cite as

20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 1-360, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@Proceedings{kingsford_et_al:LIPIcs.WABI.2020,
  title =	{{LIPIcs, Volume 172, WABI 2020, Complete Volume}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{1--360},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020},
  URN =		{urn:nbn:de:0030-drops-127881},
  doi =		{10.4230/LIPIcs.WABI.2020},
  annote =	{Keywords: LIPIcs, Volume 172, WABI 2020, Complete Volume}
}

Document

Front Matter

DOI: 10.4230/LIPIcs.WABI.2020.0

Front Matter, Table of Contents, Preface, Conference Organization

Authors: Carl Kingsford and Nadia Pisanti

Published in: LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

Abstract

Front Matter, Table of Contents, Preface, Conference Organization

Cite as

20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 0:i-0:x, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{kingsford_et_al:LIPIcs.WABI.2020.0,
  author =	{Kingsford, Carl and Pisanti, Nadia},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{0:i--0:x},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Kingsford, Carl and Pisanti, Nadia},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.0},
  URN =		{urn:nbn:de:0030-drops-127891},
  doi =		{10.4230/LIPIcs.WABI.2020.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}

37 Search Results for "Kingsford, Carl"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message