LIPIcs, Volume 344

25th International Conference on Algorithms for Bioinformatics (WABI 2025)



Thumbnail PDF

Event

WABI 2025, August 20-22, 2025, University of Maryland, College Park, MD, USA

Editors

Broňa Brejová
  • Comenius University in Bratislava, Slovakia
Rob Patro
  • University of Maryland, College Park, MD, USA

Publication Details

  • published at: 2025-08-15
  • Publisher: Schloss Dagstuhl – Leibniz-Zentrum für Informatik
  • ISBN: 978-3-95977-386-7

Access Numbers

Documents

No documents found matching your filter selection.
Document
Complete Volume
LIPIcs, Volume 344, WABI 2025, Complete Volume

Authors: Broňa Brejová and Rob Patro


Abstract
LIPIcs, Volume 344, WABI 2025, Complete Volume

Cite as

25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 1-430, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@Proceedings{brejova_et_al:LIPIcs.WABI.2025,
  title =	{{LIPIcs, Volume 344, WABI 2025, Complete Volume}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{1--430},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025},
  URN =		{urn:nbn:de:0030-drops-243498},
  doi =		{10.4230/LIPIcs.WABI.2025},
  annote =	{Keywords: LIPIcs, Volume 344, WABI 2025, Complete Volume}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Broňa Brejová and Rob Patro


Abstract
Front Matter, Table of Contents, Preface, Conference Organization

Cite as

25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{brejova_et_al:LIPIcs.WABI.2025.0,
  author =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{0:i--0:xvi},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.0},
  URN =		{urn:nbn:de:0030-drops-243477},
  doi =		{10.4230/LIPIcs.WABI.2025.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}
Document
Invited Talk
Recursive Parsing and Grammar Compression in the Era of Pangenomics (Invited Talk)

Authors: Christina Boucher


Abstract
Prefix-Free Parsing (PFP) and its recursive variant (RPFP) provide a scalable framework for compressing and indexing large genomic datasets. By enabling efficient construction of succinct data structures, these methods support fast and memory-efficient read alignment across thousands of genomes. Their deterministic and modular design makes them especially well-suited for pangenomics and large-scale sequence analysis.

Cite as

Christina Boucher. Recursive Parsing and Grammar Compression in the Era of Pangenomics (Invited Talk). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 1:1-1:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{boucher:LIPIcs.WABI.2025.1,
  author =	{Boucher, Christina},
  title =	{{Recursive Parsing and Grammar Compression in the Era of Pangenomics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{1:1--1:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.1},
  URN =		{urn:nbn:de:0030-drops-239278},
  doi =		{10.4230/LIPIcs.WABI.2025.1},
  annote =	{Keywords: Prefix-Free Parsing, Recursive Prefix-Free Parsing, Grammar-Based Compression, Succinct Data Structures, RePair Compression}
}
Document
Invited Talk
We Are What We Index; a Primer for the Wheeler Graph Era (Invited Talk)

Authors: Ben Langmead


Abstract
Since the arrival of second-generation sequencing, we have needed to build indexes over reference sequences - e.g. genomes and transcriptomes - in order to solve read alignment and classification problems efficiently [Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2009]. The rule has been: what we can index determines what we can do. When indexing strings, we can use methods like suffix arrays [Manber and Myers, 1993], the Burrows-Wheeler Transform (BWT) [Burrows and Wheeler, 1994] / FM Index [Ferragina and Manzini, 2000], or k-mer indexes [Marchet et al., 2021]. What if we want to index objects more complex than strings? A pangenome, for example, is a large collection of similar strings, e.g. the hundreds of assemblies that make up the Human Pangenome Reference [Liao et al., 2023] or all the bacteria in the Refseq database [Goldfarb et al., 2025]. We may wish to combine these strings into a multiple sequence alignment (MSA) or a graph first. Can we index those efficiently? In many useful cases the answer is "yes," but in others the answer is "no." The story of how we learned exactly when the answer is "yes" versus "no" unfolded through a sequence of insights. Here we review this story, eventually arriving at the definition of Wheeler graphs as discovered and formalized by Gagie, Manzini and Sirén [Gagie et al., 2017]. We will focus on indexes based on the BWT, since these (a) are lossless full-text indexes, (b) are widely used in practice [Langmead et al., 2009; Li and Durbin, 2009], and (c) form the theoretical throughline for all the indexing strategies on the path to Wheeler graphs. We will trace the BWT-based indexing story from the early days of the FM Index, though its step-by-step gobbling up of trees (XBW-transform [Ferragina et al., 2005]) and de Bruijn Graphs (BOSS representation [Bowe et al., 2012]), and to the eventual formalization of Wheeler graphs [Gagie et al., 2017]. Along the way, we will define and update our notions of what it means to track a consecutive range of elements in the structure, and what it means for an index to be efficient. We will also connect these notions to automata [Sipser, 1996], noting how the indexability of Wheeler graphs (also called Wheeler automata) is connected to the mechanics of how to efficiently represent and simulate a finite automaton [Alanko et al., 2021]. With this context, we can imagine improved indexes for the future of genomics and pangenomics. De Bruijn are extremely practical and are the most widely used among the non-string data structures that are also Wheeler graphs. But we might prefer other options. For example, de Bruijn graphs have the undesirable property that they usually encode not only the true longer-than-k substrings of the original text, but also "false" substrings that span repeats. Related to this, paths through the de Bruijn graph can "glue" substrings together that are horizontally distant in the MSA. Could other Wheeler graphs be practical alternatives to de Bruijn graphs? For instance, the original GCSA study by Sirén, Välimäki and Mäkinen proposed a way to convert a multiple alignment into an automaton that either is a Wheeler graph or can be made into one [Sirén et al., 2014]. This warrants further exploration, possibly with the help of improved tools for solving the NP-complete problem of recognizing whether a graph is a Wheeler graph [Chao et al., 2023]. The notion of BWT tunnels [Baier, 2018] gives another route: we can begin with a concatenated pangenome strings and compress it by identifying and collapsing BWT tunnels. This yields a Wheeler graph that is compressed like the de Bruijn graph, but without departing from the exact contents or coordinate systems of the original genomes. The future might need us to explore all these Wheeler-graph indexes, along with the also highly practical and always-improving world of indexes buiover collections of strings [Gagie et al., 2018].

Cite as

Ben Langmead. We Are What We Index; a Primer for the Wheeler Graph Era (Invited Talk). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 2:1-2:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{langmead:LIPIcs.WABI.2025.2,
  author =	{Langmead, Ben},
  title =	{{We Are What We Index; a Primer for the Wheeler Graph Era}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{2:1--2:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.2},
  URN =		{urn:nbn:de:0030-drops-239288},
  doi =		{10.4230/LIPIcs.WABI.2025.2},
  annote =	{Keywords: Indexing, Burrows-Wheeler Transform}
}
Document
Approximability of Longest Run Subsequence and Complementary Minimization Problems

Authors: Yuichi Asahiro, Mingyang Gong, Jesper Jansson, Guohui Lin, Sichen Lu, Eiji Miyano, Hirotaka Ono, Toshiki Saitoh, and Shunichi Tanaka


Abstract
We study the polynomial-time approximability of the Longest Run Subsequence problem (LRS for short) and its complementary minimization variant Minimum Run Subsequence Deletion problem (MRSD for short). For a string S = s₁ ⋯ s_n over an alphabet Σ, a subsequence S' of S is S' = s_{i₁} ⋯ s_{i_p}, such that 1 ≤ i₁ < i₂ < … < i_p ≤ |S|. A run of a symbol σ ∈ Σ in S is a maximal substring of consecutive occurrences of σ. A run subsequence S' of S is a subsequence of S in which every symbol σ ∈ Σ occurs in at most one run. The co-subsequence ̅{S'} of the subsequence S' = s_{i₁} ⋯ s_{i_p} in S is the subsequence obtained by deleting all the characters in S' from S, i.e., ̅{S'} = s_{j₁} ⋯ s_{j_{n-p}} such that j₁ < j₂ < … < j_{n-p} and {j₁, …, j_{n-p}} = {1, …, n}⧵ {i₁, …, i_p}. Given a string S, the goal of LRS (resp., MRSD) is to find a run subsequence S^* of S such that the length |S^*| is maximized (resp., the number | ̅{S^*}| of deleted symbols from S is minimized) over all the run subsequences of S. Let k be the maximum number of symbol occurrences in the input S. It is known that LRS and MRSD are APX-hard even if k = 2. In this paper, we show that LRS can be approximated in polynomial time within factors of (k+2)/3 for k = 2 or 3, and 2(k+1)/5 for every k ≥ 4. Furthermore, we show that MRSD can be approximated in linear time within a factor of (k+4)/4 if k is even and (k+3)/4 if k is odd.

Cite as

Yuichi Asahiro, Mingyang Gong, Jesper Jansson, Guohui Lin, Sichen Lu, Eiji Miyano, Hirotaka Ono, Toshiki Saitoh, and Shunichi Tanaka. Approximability of Longest Run Subsequence and Complementary Minimization Problems. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 3:1-3:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{asahiro_et_al:LIPIcs.WABI.2025.3,
  author =	{Asahiro, Yuichi and Gong, Mingyang and Jansson, Jesper and Lin, Guohui and Lu, Sichen and Miyano, Eiji and Ono, Hirotaka and Saitoh, Toshiki and Tanaka, Shunichi},
  title =	{{Approximability of Longest Run Subsequence and Complementary Minimization Problems}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{3:1--3:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.3},
  URN =		{urn:nbn:de:0030-drops-239290},
  doi =		{10.4230/LIPIcs.WABI.2025.3},
  annote =	{Keywords: Longest run subsequence, minimum run subsequence deletion, approximation algorithm}
}
Document
Haplotype-Aware Long-Read Error Correction

Authors: Parvesh Barak, Daniel Gibney, and Chirag Jain


Abstract
Error correction of long reads is an important initial step in genome assembly workflows. For organisms with ploidy greater than one, it is important to preserve haplotype-specific variation during read correction. This challenge has driven the development of several haplotype-aware correction methods. However, existing methods are based on either ad-hoc heuristics or deep learning approaches. In this paper, we introduce a rigorous formulation for this problem. Our approach builds on the minimum error correction framework used in reference-based haplotype phasing. We prove that the proposed formulation for error correction of reads in de novo context, i.e., without using a reference genome, is NP-hard. To make our exact algorithm scale to large datasets, we introduce practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant genomes show that our approach achieves accuracy comparable to state-of-the-art methods. The software is freely available at https://github.com/at-cg/HALE.

Cite as

Parvesh Barak, Daniel Gibney, and Chirag Jain. Haplotype-Aware Long-Read Error Correction. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 4:1-4:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{barak_et_al:LIPIcs.WABI.2025.4,
  author =	{Barak, Parvesh and Gibney, Daniel and Jain, Chirag},
  title =	{{Haplotype-Aware Long-Read Error Correction}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{4:1--4:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.4},
  URN =		{urn:nbn:de:0030-drops-239300},
  doi =		{10.4230/LIPIcs.WABI.2025.4},
  annote =	{Keywords: Genome assembly, phasing, clustering, overlap graph, consensus}
}
Document
Improved Algorithms for Bi-Partition Function Computation

Authors: John D. Bridgers, Jan Hoinka, S. Cenk Sahinalp, Salem Malikic, Teresa M. Przytycka, and Funda Ergun


Abstract
The evolutionary history of a tumor, inferred from single-cell sequencing data, is typically represented as a tree in which each subtree corresponds to a clade of cells seeded by a specific set of mutations. Traditional methods typically identify a single most likely tree for downstream analyses, such as detecting driver mutations, studying mutation co-occurrence patterns and identifying common evolutionary trajectories. However, the reliability of such inferred trees, particularly their topology, clade composition, and mutational placements, often remains uncertain. To quantify this uncertainty, the concept of a Bi-partition Function was recently introduced, providing a probabilistic measure of how reliably a mutation seeds a given clade of cells. The single available algorithm for estimating the Bi-partition Function relies on simplifying assumptions and uses sampling for limited exploration of the tree-space. In this paper, we introduce the first exact algorithm for computing the Bi-partition Function. Our algorithm scales linearly with the number of mutations but exhibits super-exponential complexity with respect to the number of cells. Despite this complexity, it establishes crucial ground truth values, essential for accurately benchmarking and validating approximate methods. Additionally, we present a GPU-accelerated version of the available sampling-based algorithm, significantly boosting the computational performance through large-scale parallelization, enabling more accurate Bi-partition Function estimates via deeper exploration of the tree spaces. We compare our methods on synthetic datasets, demonstrating that especially when the number of mutations sufficiently exceed the number of cells, our GPU-accelerated sampling algorithm closely approximates the exact ground truth values.

Cite as

John D. Bridgers, Jan Hoinka, S. Cenk Sahinalp, Salem Malikic, Teresa M. Przytycka, and Funda Ergun. Improved Algorithms for Bi-Partition Function Computation. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 5:1-5:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{bridgers_et_al:LIPIcs.WABI.2025.5,
  author =	{Bridgers, John D. and Hoinka, Jan and Sahinalp, S. Cenk and Malikic, Salem and Przytycka, Teresa M. and Ergun, Funda},
  title =	{{Improved Algorithms for Bi-Partition Function Computation}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{5:1--5:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.5},
  URN =		{urn:nbn:de:0030-drops-239318},
  doi =		{10.4230/LIPIcs.WABI.2025.5},
  annote =	{Keywords: Tumor Evolution, Bi-partition Function, Single-Cell Sequencing, Algorithms}
}
Document
Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs

Authors: Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro


Abstract
Motivation. Indexes for the colored de Bruijn graph (c-dBG) play a crucial role in computational biology by facilitating complex tasks such as read mapping and assembly. These indexes map k-mers (substrings of length k) appearing in a large collection of reference strings to the set of identifiers of the strings where they appear. These sets, colloquially referred to as color sets, tend to occupy large quantities of memory, especially for large pangenomes. Our previous work thus focused on leveraging the repetitiveness of the color sets to improve the space effectiveness of the resulting index. As a matter of fact, repetition-aware indexes can be up to one order of magnitude smaller on large pangenomes compared to indexes that do not exploit such repetitiveness. Such improved space effectiveness, on the other hand, imposes an overhead at query time when performing tasks such as pseudoalignment that require the collection and processing of multiple related color sets. Methods. In this paper, we show how to avoid this overhead. We devise novel query algorithms tailored for the specific repetition-aware representations adopted by the Fulgor index, a state-of-the-art c-dBG index, to significantly improve its pseudoalignment efficiency and without consuming additional space. Results. Our results indicate that with increasing redundancy in the pangenomes, the compression factor provided by the Fulgor index increases, while the relative query time actually reduces. For example, while the space of the Fulgor index improves by 2.5× with repetition-aware compression and its query time improves by 1.6× on a collection of 5,000 Salmonella Enterica genomes, these factors become (6.1×,2.8×) and (11.2×,3.2×) for 50,000 and 150,000 genomes respectively. For an even larger collection of 300,000 genomes, we obtained an index that is 22.3× smaller and 2.2× faster.

Cite as

Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro. Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{campanelli_et_al:LIPIcs.WABI.2025.6,
  author =	{Campanelli, Alessio and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{6:1--6:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.6},
  URN =		{urn:nbn:de:0030-drops-239327},
  doi =		{10.4230/LIPIcs.WABI.2025.6},
  annote =	{Keywords: Colored de Bruijn graphs, Pseudoalignment, Repetition-aware compression}
}
Document
Sequence Similarity Estimation by Random Subsequence Sketching

Authors: Ke Chen, Vinamratha Pattar, and Mingfu Shao


Abstract
Sequence similarity estimation is essential for many bioinformatics tasks, including functional annotation, phylogenetic analysis, and overlap graph construction. Alignment-free methods aim to solve large-scale sequence similarity estimation by mapping sequences to more easily comparable features that can approximate edit distances efficiently. Substrings or k-mers, as the dominant choice of features, face an unavoidable compromise between sensitivity and specificity when selecting the proper k-value. Recently, subsequence-based features have shown improved performance, but they are computationally demanding, and determining the ideal subsequence length remains an intricate art. In this work, we introduce SubseqSketch, a novel alignment-free scheme that maps a sequence to an integer vector, where the entries correspond to dynamic, rather than fixed, lengths of random subsequences. The cosine similarity between these vectors exhibits a strong correlation with the edit similarity between the original sequences. Through experiments on benchmark datasets, we demonstrate that SubseqSketch is both efficient and effective across various alignment-free tasks, including nearest neighbor search and phylogenetic clustering. A C++ implementation of SubseqSketch is openly available at https://github.com/Shao-Group/SubseqSketch.

Cite as

Ke Chen, Vinamratha Pattar, and Mingfu Shao. Sequence Similarity Estimation by Random Subsequence Sketching. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{chen_et_al:LIPIcs.WABI.2025.7,
  author =	{Chen, Ke and Pattar, Vinamratha and Shao, Mingfu},
  title =	{{Sequence Similarity Estimation by Random Subsequence Sketching}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{7:1--7:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.7},
  URN =		{urn:nbn:de:0030-drops-239332},
  doi =		{10.4230/LIPIcs.WABI.2025.7},
  annote =	{Keywords: Alignment-free sequence comparison, Phylogenetic clustering, Nearest neighbor search, Edit distance embedding}
}
Document
Lossless Pangenome Indexing Using Tag Arrays

Authors: Parsa Eskandar, Benedict Paten, and Jouni Sirén


Abstract
Pangenome graphs represent the genomic variation by encoding multiple haplotypes within a unified graph structure. However, efficient and lossless indexing of such structures remains challenging due to the scale and complexity of pangenomic data. We present a practical and scalable indexing framework based on tag arrays, which annotate positions in the Burrows-Wheeler transform (BWT) with graph coordinates. Our method extends the FM-index with a run-length compressed tag structure that enables efficient retrieval of all unique graph locations where a query pattern appears. We introduce a novel construction algorithm that combines unique k-mers, graph-based extensions, and haplotype traversal to compute the tag array in a memory-efficient manner. To support large genomes, we process each chromosome independently and then merge the results into a unified index using properties of the multi-string BWT and r-index. Our evaluation on the HPRC graphs demonstrates that the tag array structure compresses effectively, scales well with added haplotypes, and preserves accurate mapping information across diverse regions of the genome. This indexing method enables lossless and haplotype-aware querying in complex pangenomes and offers a practical indexing layer to develop scalable aligners and downstream graph-based analysis tools.

Cite as

Parsa Eskandar, Benedict Paten, and Jouni Sirén. Lossless Pangenome Indexing Using Tag Arrays. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 8:1-8:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{eskandar_et_al:LIPIcs.WABI.2025.8,
  author =	{Eskandar, Parsa and Paten, Benedict and Sir\'{e}n, Jouni},
  title =	{{Lossless Pangenome Indexing Using Tag Arrays}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{8:1--8:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.8},
  URN =		{urn:nbn:de:0030-drops-239348},
  doi =		{10.4230/LIPIcs.WABI.2025.8},
  annote =	{Keywords: pangenome indexing, Burrows-Wheeler transform, tag arrays}
}
Document
Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer

Authors: Daniel W. Feng and Mohammed El-Kebir


Abstract
Several recent cancer phylogeny inference methods have used the k-Dollo evolutionary model for single-nucleotide variants. Specifically, in this problem one is given an m × n binary matrix B and seeks a rooted tree T with m leaves that correspond to the m rows of B, and each node of T is labeled by a binary state for each of the n characters subject to the restriction that each character is gained at most once (0-to-1 transition) and subsequently lost at most k times (1-to-0 transitions). The 1-Dollo variant, also known as the persistent perfect phylogeny where one is restricted to at most k = 1 losses per character, has been studied extensively, but its hardness remains an open question. Here, we prove that the 1-Dollo Linear Phylogeny (1DLP) problem, where we additionally require the resulting 1-Dollo phylogeny T to be linear, is equivalent to verifying whether the input matrix B adheres to the Consecutive Ones Property (C1P), which can be solved in polynomial time. Due to the equivalence, several known NP-hardness results for relevant variants of C1P carry over to 1DLP, including the minimization of false negatives (0-to-1 modifications to the input matrix B) or the allowance of 2 gains and 2 losses. We furthermore show how we can recursively decompose any, not necessarily linear, 1-Dollo phylogeny T into several 1-Dollo linear phylogenies, connected by matching branching points. We extend this characterization to matrices B that admit 1-Dollo phylogenies, giving necessary and sufficient conditions for the existence of a novel decomposition of B into several submatrices and corresponding branching points. This decomposition forms the basis of Dolphyin, a new exponential-time algorithm for inferring 1-Dollo phylogenies that efficiently leverages the determination of linear 1-Dollo phylogenies as a subroutine. Dolphyin can also be applied to input matrices B with false negatives. We demonstrate that Dolphyin is runtime-competitive with a previous integer linear programming based algorithm SPhyR on simulated datasets. We additionally analyze simulated datasets with false negative errors and find that in the median case, Dolphyin infers 1-Dollo phylogenies with inferred error rates at or below the ground truth rate. Finally, we apply Dolphyin to 99 acute myeloid leukemia single-cell sequencing datasets, finding that the majority of the cancers can be explained by 1-Dollo phylogenies with false negative error rates in line with the used sequencing technology. Availability. Dolphyin is available at: https://github.com/elkebir-group/Dolphyin.

Cite as

Daniel W. Feng and Mohammed El-Kebir. Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 9:1-9:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{feng_et_al:LIPIcs.WABI.2025.9,
  author =	{Feng, Daniel W. and El-Kebir, Mohammed},
  title =	{{Dolphyin: A Combinatorial Algorithm for Identifying 1-Dollo Phylogenies in Cancer}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{9:1--9:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.9},
  URN =		{urn:nbn:de:0030-drops-239356},
  doi =		{10.4230/LIPIcs.WABI.2025.9},
  annote =	{Keywords: Intra-tumor heterogeneity, persistent perfect phylogeny, consecutive ones property, combinatorics}
}
Document
DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs

Authors: Ali Ghaffaari, Alexander Schönhuth, and Tobias Marschall


Abstract
Determining the distance between two loci within a genomic region is a recurrent operation in various tasks in computational genomics. A notable example of this task arises in paired-end read mapping as a form of validation of distances between multiple alignments. While straightforward for a single genome, graph-based reference structures render the operation considerably more involved. Given the sheer number of such queries in a typical read mapping experiment, an efficient algorithm for answering distance queries is crucial. In this paper, we introduce DiVerG, a compact data structure as well as a fast and scalable algorithm, for constructing distance indexes for general sequence graphs on multi-core CPU and many-core GPU architectures. DiVerG is based on PairG [Jain et al., 2019], but overcomes the limitations of PairG by exploiting the extensive potential for improvements in terms of scalability and space efficiency. As a consequence, DiVerG can process substantially larger datasets, such as whole human genomes, which are unmanageable by PairG. DiVerG offers faster index construction time and consistently faster query time with gains proportional to the size of the underlying compact data structure. We demonstrate that our method performs favorably on multiple real datasets at various scales. DiVerG achieves superior performance over PairG; e.g. resulting to 2.5-4x speed-up in query time, 44-340x smaller index size, and 3-50x faster construction time for the genome graph of the MHC region, as a particularly variable region of the human genome. The implementation is available at: https://github.com/cartoonist/diverg

Cite as

Ali Ghaffaari, Alexander Schönhuth, and Tobias Marschall. DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 10:1-10:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{ghaffaari_et_al:LIPIcs.WABI.2025.10,
  author =	{Ghaffaari, Ali and Sch\"{o}nhuth, Alexander and Marschall, Tobias},
  title =	{{DiVerG: Scalable Distance Index for Validation of Paired-End Alignments in Sequence Graphs}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{10:1--10:24},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.10},
  URN =		{urn:nbn:de:0030-drops-239369},
  doi =		{10.4230/LIPIcs.WABI.2025.10},
  annote =	{Keywords: Sequence graph, distance index, read mapping, sparse matrix}
}
Document
Mutational Signature Refitting on Sparse Pan-Cancer Data

Authors: Gal Gilad, Teresa M. Przytycka, and Roded Sharan


Abstract
Mutational processes shape cancer genomes, leaving characteristic marks that are termed signatures. The level of activity of each such process, or its signature exposure, provides important information on the disease, improving patient stratification and the prediction of drug response. Thus, there is growing interest in developing refitting methods that decipher those exposures. Previous work in this domain was unsupervised in nature, employing algebraic decomposition and probabilistic inference methods. Here we provide a supervised approach to the problem of signature refitting and show its superiority over current methods. Our method, SuRe, leverages a neural network model to capture correlations between signature exposures in real data. We show that SuRe outperforms previous methods on sparse mutation data from tumor type specific data sets, as well as pan-cancer data sets, with an increasing advantage as the data become sparser. We further demonstrate its utility in clinical settings.

Cite as

Gal Gilad, Teresa M. Przytycka, and Roded Sharan. Mutational Signature Refitting on Sparse Pan-Cancer Data. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 11:1-11:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{gilad_et_al:LIPIcs.WABI.2025.11,
  author =	{Gilad, Gal and Przytycka, Teresa M. and Sharan, Roded},
  title =	{{Mutational Signature Refitting on Sparse Pan-Cancer Data}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{11:1--11:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.11},
  URN =		{urn:nbn:de:0030-drops-239374},
  doi =		{10.4230/LIPIcs.WABI.2025.11},
  annote =	{Keywords: mutational signatures, signature refitting, cancer genomics, genomic data analysis, somatic mutations}
}
Document
Extension of Partial Atom-To-Atom Maps: Uniqueness and Algorithms

Authors: Marcos E. González Laffitte, Tieu-Long Phan, and Peter F. Stadler


Abstract
Chemical reaction databases typically report the molecular structures of reactant and product compounds, as well as their stoichiometry, but lack information, in particular, on the correspondence of reactant and product atoms. These atom-to-atom maps (AAM), however, are crucial for applications including chemical synthesis planning in organic chemistry and the analysis of isotope labeling experiments in modern metabolomics. AAMs therefore need to be reconstructed computationally. This situation is aggravated, furthermore, by the fact that chemically correct AAMs are, fundamentally, determined by quantum-mechanical phenomena and thus cannot be reliably computed by solving graph-theoretical optimization problems defined by the reactant and product structures. A viable solution for this problem is to shift the focus into first identifying a partial AAM containing the reaction center, i.e., covering the atoms incident with all bonds that change during a reaction. This then leads to the problem of extending the partial map to the full reaction. The AAM of a reaction is faithfully represented by the Imaginary Transition State (ITS) graph, providing a convenient graph-theoretic framework to address the questions of when and how a partial AAM can be extended. We show that an unique extension exists whenever, and only if, these partial AAMs cover the reaction center. In this case their extension can be computed by solving a constrained graph-isomorphism search between specific subgraphs of ITS graphs. We close by benchmarking different tools for this task.

Cite as

Marcos E. González Laffitte, Tieu-Long Phan, and Peter F. Stadler. Extension of Partial Atom-To-Atom Maps: Uniqueness and Algorithms. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 12:1-12:26, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{gonzalezlaffitte_et_al:LIPIcs.WABI.2025.12,
  author =	{Gonz\'{a}lez Laffitte, Marcos E. and Phan, Tieu-Long and Stadler, Peter F.},
  title =	{{Extension of Partial Atom-To-Atom Maps: Uniqueness and Algorithms}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{12:1--12:26},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.12},
  URN =		{urn:nbn:de:0030-drops-239410},
  doi =		{10.4230/LIPIcs.WABI.2025.12},
  annote =	{Keywords: atom-to-atom maps, imaginary transition state (ITS) graphs, condensed graph of the reaction (CGR), chemical reaction mechanisms, molecular graphs, metabolic networks, chemical synthesis planning, constrained graph isomorphism}
}
Document
Spark: Sparsified Hierarchical Energy Minimization of RNA Pseudoknots

Authors: Mateo Gray, Sebastian Will, and Hosna Jabbari


Abstract
Motivation. Determining RNA structure is essential for understanding RNA function and interaction networks. Although experimental techniques yield high‑accuracy structures, they are costly and time‑consuming; thus, computational approaches - especially minimum‑free‑energy (MFE) prediction algorithms - are indispensable. Accurately predicting pseudoknots, however, remains challenging because their inclusion usually leads to prohibitive computational complexity. Recent work demonstrated that sparsification can improve the efficiency of complex pseudoknot prediction algorithms such as Knotty. This finding suggests similar gains are possible for already efficient algorithms like HFold, which targets a complementary class of hierarchically constrained pseudoknots. Results. We introduce Spark, an exact, fully sparsified algorithm for predicting pseudoknotted RNA structures. Like its non‑sparsified predecessor HFold, Spark searches for the minimum‑energy structure under the HotKots 2.0 energy model, a pseudoknot extension of the Turner model. Because the sparsification is non‑heuristic, Spark preserves the asymptotic time‑ and space‑complexity guarantees of HFold while greatly reducing the constant factors. We benchmarked the performance of Spark against HFold and, as a pseudoknot‑free baseline, RNAfold. Compared with HFold, Spark substantially lowers both run time and memory usage, while achieving run‑time figures close to those of RNAfold. Across all tested sequence lengths, Spark used the least memory and consistently ran faster than HFold. Conclusion. By extending non‑heuristic sparsification to hierarchical pseudoknot prediction, Spark delivers an exceptionally fast and memory‑efficient tool accurate prediction of pseudoknotted RNA structures, enabling routine analysis of long sequences. The algorithm broadens the practical scope of computational RNA biology and provides a solid foundation for future advances in structure‑based functional annotation. Availability. Spark’s implementation and detailed results are available at https://github.com/TheCOBRALab/Spark.

Cite as

Mateo Gray, Sebastian Will, and Hosna Jabbari. Spark: Sparsified Hierarchical Energy Minimization of RNA Pseudoknots. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{gray_et_al:LIPIcs.WABI.2025.13,
  author =	{Gray, Mateo and Will, Sebastian and Jabbari, Hosna},
  title =	{{Spark: Sparsified Hierarchical Energy Minimization of RNA Pseudoknots}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{13:1--13:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.13},
  URN =		{urn:nbn:de:0030-drops-239383},
  doi =		{10.4230/LIPIcs.WABI.2025.13},
  annote =	{Keywords: RNA, MFE, Secondary Structure Prediction, Pseudoknot, Sparsification, Space Complexity, Time Complexity}
}
Document
Human Readable Compression of GFA Paths Using Grammar-Based Code

Authors: Peter Heringer and Daniel Doerr


Abstract
Pangenome graphs offer a compact and comprehensive representation of genomic diversity, improving tasks such as variant calling, genotyping, and other downstream analyses. Although the underlying graph structures scale sublinearly with the number of haplotypes, the widely used GFA file format suffers from rapidly growing file sizes due to the explicit and repetitive encoding of haplotype paths. In this work, we introduce an extension to the GFA format that enables efficient grammar-based compression of haplotype paths while retaining human readability. In addition, grammar-based encoding provides an efficient in-memory data structure that does not require decompression, but conversely improves the runtime of many computational tasks that involve haplotype comparisons. We present sqz, a method that makes use of the proposed format extension to encode haplotype paths using byte pair encoding, a grammar-based compression scheme. We evaluate sqz on recent human pangenome graphs from Heumos et al. and the Human Pangenome Reference Consortium (HPRC), comparing it to existing compressors bgzip, gbz, and sequitur. sqz scales sublinearly with the number of haplotypes in a pangenome graph and consistently achieves higher compression ratios than sequitur and up to 5 times better compression than bgzip in HPRC graphs and up to 10 times in the graph from Heumos et al.. When combined with bgzip, sqz matches or excels the compression ratio of gbz across all our datasets. These results demonstrate the potential of our proposed extension of the GFA format in reducing haplotype path redundancy and improving storage efficiency for pangenome graphs.

Cite as

Peter Heringer and Daniel Doerr. Human Readable Compression of GFA Paths Using Grammar-Based Code. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{heringer_et_al:LIPIcs.WABI.2025.14,
  author =	{Heringer, Peter and Doerr, Daniel},
  title =	{{Human Readable Compression of GFA Paths Using Grammar-Based Code}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{14:1--14:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.14},
  URN =		{urn:nbn:de:0030-drops-239395},
  doi =		{10.4230/LIPIcs.WABI.2025.14},
  annote =	{Keywords: pangenomics, pangenome graphs, compression, grammar-based code, byte pair encoding}
}
Document
Average-Tree Phylogenetic Diversity of Networks

Authors: Leo van Iersel, Mark Jones, Jannik Schestag, Celine Scornavacca, and Mathias Weller


Abstract
Phylogenetic diversity is a measure used to quantify the biodiversity of a set of species. Here, we introduce the "average-tree" phylogenetic diversity score in rooted binary phylogenetic networks and consider algorithms for computing and maximizing the score on a given network. Basically, the score is the weighted average of the phylogenetic diversity scores in all trees displayed by the network, with the weights determined by the inheritance probabilities on the reticulation edges used in the embeddings. We show that computing the score of a given set of taxa in a given network is #P-hard, directly implying #P-hardness of finding a subset of k taxa achieving maximum diversity score and, thereby, ruling out polynomial-time algorithms for these problems unless the polynomial hierarchy collapses. However, we show that both problems can be solved efficiently if the input network is close to being a tree in the sense that its reticulation number is small. More precisely, we prove that we can solve the optimization problem in networks with n leaves and r reticulations in 2^{𝒪(r)}⋅ n⋅ k time. Using experiments on data produced by simulating a reticulate-evolution process, we show that our algorithm runs efficiently on networks with hundreds of taxa and tens of reticulations.

Cite as

Leo van Iersel, Mark Jones, Jannik Schestag, Celine Scornavacca, and Mathias Weller. Average-Tree Phylogenetic Diversity of Networks. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 15:1-15:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{vaniersel_et_al:LIPIcs.WABI.2025.15,
  author =	{van Iersel, Leo and Jones, Mark and Schestag, Jannik and Scornavacca, Celine and Weller, Mathias},
  title =	{{Average-Tree Phylogenetic Diversity of Networks}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{15:1--15:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.15},
  URN =		{urn:nbn:de:0030-drops-239405},
  doi =		{10.4230/LIPIcs.WABI.2025.15},
  annote =	{Keywords: phylogenetic diversity, phylogenetic networks, network phylogenetic diversity, algorithms, computational complexity}
}
Document
Estimation of Substitution and Indel Rates via k-mer Statistics

Authors: Mahmudur Rahman Hera, Paul Medvedev, David Koslicki, and Antonio Blanca


Abstract
Methods utilizing k-mers are widely used in bioinformatics, yet our understanding of their statistical properties under realistic mutation models remains incomplete. Previously, substitution-only mutation models have been considered to derive precise expectations and variances for mutated k-mers and intervals of mutated and non-mutated sequences. In this work, we consider a mutation model that incorporates insertions and deletions in addition to single-nucleotide substitutions. Within this framework, we derive closed-form k-mer-based estimators for the three fundamental mutation parameters: substitution, deletion rate, and insertion rates. We provide theoretical guarantees in the form of concentration inequalities, ensuring accuracy of our estimators under reasonable model assumptions. Empirical evaluations on simulated evolution of genomic sequences confirm our theoretical findings, demonstrating that accounting for insertions and deletions signals allows for accurate estimation of mutation rates and improves upon the results obtained by considering a substitution-only model. An implementation of estimating the mutation parameters from a pair of fasta files is available here: https://github.com/KoslickiLab/estimate_rates_using_mutation_model.git. The results presented in this manuscript can be reproduced using the code available here: https://github.com/KoslickiLab/est_rates_experiments.git.

Cite as

Mahmudur Rahman Hera, Paul Medvedev, David Koslicki, and Antonio Blanca. Estimation of Substitution and Indel Rates via k-mer Statistics. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{rahmanhera_et_al:LIPIcs.WABI.2025.16,
  author =	{Rahman Hera, Mahmudur and Medvedev, Paul and Koslicki, David and Blanca, Antonio},
  title =	{{Estimation of Substitution and Indel Rates via k-mer Statistics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{16:1--16:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.16},
  URN =		{urn:nbn:de:0030-drops-239422},
  doi =		{10.4230/LIPIcs.WABI.2025.16},
  annote =	{Keywords: k-mers, mutation rate, indel, alignment-free, estimation, substitution, insertion, deletion}
}
Document
An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Authors: Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang


Abstract
String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold ℒ. These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Cite as

Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{sanaullah_et_al:LIPIcs.WABI.2025.17,
  author =	{Sanaullah, Ahsan and Zhi, Degui and Zhang, Shaojie},
  title =	{{An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{17:1--17:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.17},
  URN =		{urn:nbn:de:0030-drops-239433},
  doi =		{10.4230/LIPIcs.WABI.2025.17},
  annote =	{Keywords: BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome}
}
Document
Identifying Breakpoint Median Genomes: A Branching Algorithm Approach

Authors: Poly H. da Silva, Arash Jamshidpey, and David Sankoff


Abstract
Genome comparison often involves quantifying dissimilarities between genomes with identical gene sets, commonly using breakpoints - points where adjacent genes in one genome are not adjacent in another. The concept of a median genome, used for comparison of multiple genomes, aims to find a genome that minimizes the total distance to all genomes in a given set. While median genomes are useful for extracting common genomic information and estimating ancestral traits, the existence of multiple divergent medians raises concerns about their accuracy in reflecting the true ancestor. The median problem is known to be NP-hard, particularly for unichromosomal genomes, and solving it becomes increasingly challenging under different genome distance models. In this work, we introduce a novel branching algorithm to efficiently find all breakpoint medians of k linear unichromosomal genomes, represented as unsigned permutations. This algorithm constructs a rooted labeled tree, where the sequence of labels along each complete ray defines a genome, providing a structured and efficient way to explore the space of candidate medians by narrowing the search to a well-defined and significantly smaller subset of the permutation space. We validate our approach with experiments on randomly generated sets of three permutations. The results show that our method successfully finds the exact medians and also identifies many near-optimal approximations. Our experiments further show that most medians lie relatively close to the input permutations, in agreement with prior theoretical results.

Cite as

Poly H. da Silva, Arash Jamshidpey, and David Sankoff. Identifying Breakpoint Median Genomes: A Branching Algorithm Approach. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 18:1-18:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{dasilva_et_al:LIPIcs.WABI.2025.18,
  author =	{da Silva, Poly H. and Jamshidpey, Arash and Sankoff, David},
  title =	{{Identifying Breakpoint Median Genomes: A Branching Algorithm Approach}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{18:1--18:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.18},
  URN =		{urn:nbn:de:0030-drops-239447},
  doi =		{10.4230/LIPIcs.WABI.2025.18},
  annote =	{Keywords: Breakpoint distance, median genomes, phylogeny reconstruction, random permutations}
}
Document
Which Phylogenetic Networks Are Level-k Networks with Additional Arcs? Structure and Algorithms

Authors: Takatora Suzuki and Momoko Hayamizu


Abstract
Reticulate evolution gives rise to complex phylogenetic networks, making their interpretation challenging. A typical approach is to extract trees within such networks. Since Francis and Steel’s seminal paper, "Which Phylogenetic Networks are Merely Trees with Additional Arcs?" (2015), tree-based phylogenetic networks and their support trees (spanning trees with the same root and leaf-set as a given network) have been extensively studied. However, not all phylogenetic networks are tree-based, and for the study of reticulate evolution, it is often more biologically relevant to identify support networks rather than trees. This study generalizes Hayamizu’s structure theorem, which yielded optimal algorithms for various computational problems on support trees of rooted almost-binary phylogenetic networks, to extend the theoretical framework for support trees to support networks. This allows us to obtain a direct-product characterization of each of three sets: all, minimal, and minimum support networks, for a given network. Each characterization yields optimal algorithms for counting and generating the support networks of each type. Applications include a linear-time algorithm for finding a support network with the fewest reticulations (i.e., the minimum tier). We also provide exact and heuristic algorithms for finding a support network with the minimum level, both running in exponential time but practical across a reasonably wide range of reticulation numbers.

Cite as

Takatora Suzuki and Momoko Hayamizu. Which Phylogenetic Networks Are Level-k Networks with Additional Arcs? Structure and Algorithms. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 19:1-19:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{suzuki_et_al:LIPIcs.WABI.2025.19,
  author =	{Suzuki, Takatora and Hayamizu, Momoko},
  title =	{{Which Phylogenetic Networks Are Level-k Networks with Additional Arcs? Structure and Algorithms}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{19:1--19:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.19},
  URN =		{urn:nbn:de:0030-drops-239454},
  doi =		{10.4230/LIPIcs.WABI.2025.19},
  annote =	{Keywords: Phylogenetic networks, Support networks, Level-k networks, Tier-k networks, Structure theorem, Enumeration, Optimization}
}
Document
A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences

Authors: Haonan Wu, Antonio Blanca, and Paul Medvedev


Abstract
K-mer-based analysis of genomic data is ubiquitous, but the presence of repetitive k-mers continues to pose problems for the accuracy of many methods. For example, the Mash tool (Ondov et al. 2016) can accurately estimate the substitution rate between two low-repetitive sequences from their k-mer sketches; however, it is inaccurate on repetitive sequences such as the centromere of a human chromosome. Follow-up work by Blanca et al. (2021) has attempted to model how mutations affect k-mer sets based on strong assumptions that the sequence is non-repetitive and that mutations do not create spurious k-mer matches. However, the theoretical foundations for extending an estimator like Mash to work in the presence of repeat sequences have been lacking. In this work, we relax the non-repetitive assumption and propose a novel estimator for the mutation rate. We derive theoretical bounds on our estimator’s bias. Our experiments show that it remains accurate for repetitive genomic sequences, such as the alpha satellite higher order repeats in centromeres. We demonstrate our estimator’s robustness across diverse datasets and various ranges of the substitution rate and k-mer size. Finally, we show how sketching can be used to avoid dealing with large k-mer sets while retaining accuracy. Our software is available at https://github.com/medvedevgroup/Repeat-Aware_Substitution_Rate_Estimator.

Cite as

Haonan Wu, Antonio Blanca, and Paul Medvedev. A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 20:1-20:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{wu_et_al:LIPIcs.WABI.2025.20,
  author =	{Wu, Haonan and Blanca, Antonio and Medvedev, Paul},
  title =	{{A k-mer-Based Estimator of the Substitution Rate Between Repetitive Sequences}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{20:1--20:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.20},
  URN =		{urn:nbn:de:0030-drops-239465},
  doi =		{10.4230/LIPIcs.WABI.2025.20},
  annote =	{Keywords: k-mers, sketching, mutation rates}
}
Document
Linear-Space Subquadratic-Time String Alignment Algorithm for Arbitrary Scoring Matrices

Authors: Ryosuke Yamano and Tetsuo Shibuya


Abstract
Theoretically, the fastest algorithm by Crochemore et al. for computing the alignment of two given strings of size n over a constant alphabet takes O(n²/log n) time. The algorithm uses Lempel–Ziv parsing to divide the dynamic programming matrix into blocks and utilizes the repetitive structure. It is the only previously known subquadratic-time algorithm that can handle scoring matrices of arbitrary weights. However, this algorithm takes O(n²/log n) space, and reducing the space while preserving the time complexity has been an open problem for more than 20 years. We present a solution to this issue by achieving an O(n) space algorithm that maintains O(n²/log n) time. The classical refinement by Hirschberg reduces the space complexity of the textbook O(n²) algorithm to O(n) while preserving the quadratic time. However, applying this technique to the algorithm of Crochemore et al. has been considered challenging because their method requires O(n² / log n) space even when computing only the alignment score. Our modification enables the application of Hirschberg’s refinement, allowing traceback computation in O(n) space while preserving the O(n² / log n) overall time complexity. Our algorithm can be applied to both global and local string alignment problems.

Cite as

Ryosuke Yamano and Tetsuo Shibuya. Linear-Space Subquadratic-Time String Alignment Algorithm for Arbitrary Scoring Matrices. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 21:1-21:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{yamano_et_al:LIPIcs.WABI.2025.21,
  author =	{Yamano, Ryosuke and Shibuya, Tetsuo},
  title =	{{Linear-Space Subquadratic-Time String Alignment Algorithm for Arbitrary Scoring Matrices}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{21:1--21:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.21},
  URN =		{urn:nbn:de:0030-drops-239479},
  doi =		{10.4230/LIPIcs.WABI.2025.21},
  annote =	{Keywords: String alignment, dynamic programming, linear space algorithms}
}
Document
Design of Worst-Case-Optimal Spaced Seeds

Authors: Jens Zentgraf and Sven Rahmann


Abstract
Read mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many differences are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently long exact match (seed) must exist. While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namely spaced seeds or gapped k-mers. Given two integers k ≤ w, one considers k selected positions, described by a mask, from each length-w window in a sequence. In the existing literature, masks with certain probabilistic guarantees have been designed for small values of k. Here, for the first time, we take a combinatorial approach from a worst-case perspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks or all symmetric masks of a given shape (k,w), we find the set of best masks that maximize these minima. As a result, we obtain robust masks, even for large numbers of changes. We illustrate the properties of these masks by constructing a challenging set of reads that contain many approximately equidistributed substitutions (but no indels) that many existing tools cannot map, even though they are in principle easily mappable (apart from the large number of changes) because they originate from selected non-repetitive regions of the human reference genome. We observe that the majority of these reads can be mapped with a simple alignment-free approach using chosen spaced masks, where seeding approaches based on contiguous k-mers fail.

Cite as

Jens Zentgraf and Sven Rahmann. Design of Worst-Case-Optimal Spaced Seeds. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{zentgraf_et_al:LIPIcs.WABI.2025.22,
  author =	{Zentgraf, Jens and Rahmann, Sven},
  title =	{{Design of Worst-Case-Optimal Spaced Seeds}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{22:1--22:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.22},
  URN =		{urn:nbn:de:0030-drops-239488},
  doi =		{10.4230/LIPIcs.WABI.2025.22},
  annote =	{Keywords: Spaced seed, Gapped k-mer, Integer linear program (ILP), Worst-case design, Reference bias}
}
Document
Extended Abstract
Partitioned Multi-MUM Finding for Scalable Pangenomics (Extended Abstract)

Authors: Vikram S. Shivakumar and Ben Langmead


Abstract
Pangenome collections continue to grow and proliferate to hundreds of high-quality genomes, for example, the expanded v2 version of the Human Pangenome Reference Consortium (HPRC) dataset spanning 474 human haplotypes [Liao et al., 2023]. As the size and complexity of these collections grow, it is increasingly important that our methods for studying and indexing pangenomes be scalable and updateable. Maximal Unique Matches (multi-MUMs), exact substring matches present exactly once in all sequences in a pangenome collection, represent conserved anchor sequences that can comprise a common coordinate system. We previously proposed a framework and tool called Mumemto for rapidly identifying multi-MUMs during construction of a compressed pangenome index [Shivakumar and Langmead, 2025]. Using prefix-free parsing (PFP) [Boucher et al., 2019], a compressed-space method for computing full-text indexes, Mumemto outperforms existing methods for identifying multi-MUMs. However, one drawback remains updateability and scalability. Mumemto can become memory-intensive for large pangenomes (> 300 human genomes), and as newly assembled genomes are added to a pangenome collection, Mumemto requires re-running on the entire updated collection. To address this, we developed a partition-merging approach to compute multi-MUMs with Mumemto. We introduce two strategies for merging of multi-MUMs computed across different collections (see Figure 1), enabling parallelization across partitions and simple computation of multi-MUMs for incrementally-updated collections. The first strategy requires a common sequence in each partition (which we call "anchor-based merging"), which serves as a coordinate system to identify multi-MUM overlaps between partitions. By tracking the next longest match for all multi-MUMs and unique matches (UMs) in an auxiliary data structure, intersections between matches can be filtered out if no longer unique in the union collection. The second strategy identifies overlaps directly from the multi-MUM substrings (called "string-based merging"). The overlaps are identified by running Mumemto over the extracted multi-MUM sequences and are similarly filtered out if they are too short to considered unique. Lastly, we propose an extension to anchor-based merging to enable the computation of partial multi-MUMs, present in only a subset of sequences in the union set. The partition-merging framework introduces a tradeoff space in Mumemto between running time and memory, depending on partition size and the number of threads. Running parallel, per-partition Mumemto processes and merging the results reduces the running time but increases the peak memory footprint, while running a single Mumemto thread over each partition serially yields longer running time but a smaller memory footprint. To evaluate this tradeoff, we computed multi-MUMs across 474 haplotypes of chr19 from the HPRC v2 dataset [Liao et al., 2023] and 69 assemblies of A. thaliana [Lian et al., 2024] (Table 1). The string-based method also enables merging multi-MUMs between disjoint collections, for example subclades in a phylogenetic tree. By merging multi-MUMs along the shape of the tree, we can compute matches at internal nodes of the tree along with the root, revealing clade-specific conservation and structural variation. Multi-MUM merging also enables interspecific match computation, which was previously infeasible with Mumemto due to high memory usage for highly-diverse input sequence collections. We use partition-merging to compute multi-MUMs across 29 primate assemblies, and found a correspondence to ultraconserved elements previously found across mammalian genomes [Cummins et al., 2024]. We show that a partitioned Mumemto enables scalability to growing pangenome collections and expands the applicability of Mumemto to larger, more diverse datasets. As a result, Mumemto is the only method capable of computing exact matches across the entire HPRC v2 dataset (474 haplotypes [Liao et al., 2023]), and can easily incorporate future releases of assemblies without recomputation. This increases the scope for exploration of genomic conservation and variation and highlights the potential for Mumemto as a core method for future pangenomics and comparative genomics research. The partitioned Mumemto framework is implemented in v1.3.0 and is available open-source at https://github.com/vikshiv/mumemto.

Cite as

Vikram S. Shivakumar and Ben Langmead. Partitioned Multi-MUM Finding for Scalable Pangenomics (Extended Abstract). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 23:1-23:4, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{shivakumar_et_al:LIPIcs.WABI.2025.23,
  author =	{Shivakumar, Vikram S. and Langmead, Ben},
  title =	{{Partitioned Multi-MUM Finding for Scalable Pangenomics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{23:1--23:4},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.23},
  URN =		{urn:nbn:de:0030-drops-239490},
  doi =		{10.4230/LIPIcs.WABI.2025.23},
  annote =	{Keywords: Pangenomics, Comparative genomics, Compressed indexing}
}

Filters


Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail