74 Search Results for "Boucher, Christina"


Volume

LIPIcs, Volume 242

22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

WABI 2022, September 5-7, 2022, Potsdam, Germany

Editors: Christina Boucher and Sven Rahmann

Document
Time-Optimal Construction of String Synchronizing Sets

Authors: Jonas Ellert and Tomasz Kociumaka

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)


Abstract
A powerful design principle behind many modern string algorithms is local consistency: breaking the symmetry between string positions based on their small contexts so that matching fragments are handled consistently. Among the most influential instantiations of this principle are string synchronizing sets [Kempa & Kociumaka; STOC 2019]. A τ-synchronizing set of a string of length n is a set of O(n/τ) string positions, chosen using their length-2τ contexts, such that (outside of highly periodic regions) every block of τ consecutive positions contains at least one element of the set. Synchronizing sets have found dozens of applications in diverse settings, from quantum and dynamic algorithms to fully compressed computation. In the classic word RAM model, particularly for strings over small alphabets, they enabled faster solutions to core problems in data compression, text indexing, and string similarity. In this work, we show that any string T ∈ [0 .. σ)ⁿ can be preprocessed in O(n log σ / log n) time so that, for any given integer τ ∈ [1 .. n], a τ-synchronizing set of T can be constructed in O((n log τ)/(τ log n)) time. Both bounds are optimal in the word RAM model with machine word size w = Θ(log n), matching the information-theoretic minimum for the input and output sizes, respectively. Previously, constructing a τ-synchronizing set required O(n/τ) time after an O(n)-time preprocessing [Kociumaka, Radoszewski, Rytter, and Waleń; SICOMP 2024], or, in the restricted regime of τ < 0.2 log_σ n, without any preprocessing needed [Kempa & Kociumaka; STOC 2019]. A simple instantiation of our method outputs the synchronizing set as a sorted list in O(n/τ) time, or as a bitmask in O(n/log n) time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in O(1) time and rank queries in O(log ((log τ)/(log log n))) time. The latter complexity matches known unconditional cell-probe lower bounds for τ ≤ n^{1-Ω(1)}. To achieve this, we introduce a general framework for efficiently processing sparse integer sequences via a custom variable-length encoding. We also augment the optimal variant of van Emde Boas trees [Pătraşcu & Thorup; STOC 2006] with a deterministic linear-time construction. When the set is represented as a bitmask under our sparse encoding, the same guarantees for select and rank queries hold after preprocessing in time proportional to the size of our encoding (in words).

Cite as

Jonas Ellert and Tomasz Kociumaka. Time-Optimal Construction of String Synchronizing Sets. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 36:1-36:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)


Copy BibTex To Clipboard

@InProceedings{ellert_et_al:LIPIcs.STACS.2026.36,
  author =	{Ellert, Jonas and Kociumaka, Tomasz},
  title =	{{Time-Optimal Construction of String Synchronizing Sets}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{36:1--36:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.36},
  URN =		{urn:nbn:de:0030-drops-255258},
  doi =		{10.4230/LIPIcs.STACS.2026.36},
  annote =	{Keywords: synchronizing sets, local consistency, packed strings}
}
Document
Fast Computation of k-Runs, Parameterized Squares, and Other Generalised Squares

Authors: Yuto Nakashima, Jakub Radoszewski, and Tomasz Waleń

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)


Abstract
A k-mismatch square is a string of the form XY where X and Y are two equal-length strings that have at most k mismatches. Kolpakov and Kucherov [Theor. Comput. Sci., 2003] defined two notions of k-mismatch repeats, called k-repetitions and k-runs, each representing a sequence of consecutive k-mismatch squares of equal length. They proposed algorithms for computing k-repetitions and k-runs working in 𝒪(nklog k+output) time for a string of length n over an integer alphabet, where output is the number of the reported repeats. We show that output = 𝒪(nk log k), both in case of k-repetitions and k-runs, which implies that the complexity of their algorithms is actually 𝒪(nk log k). We apply this result to computing parameterized squares. A parameterized square is a string of the form XY such that X and Y parameterized-match, i.e., there exists a bijection f on the alphabet such that f(X) = Y. Two parameterized squares XY and X'Y' are equivalent if they parameterized match. Recently Hamai et al. [SPIRE 2024] showed that a string of length n over an alphabet of size σ contains less than nσ non-equivalent parameterized squares, improving an earlier bound by Kociumaka et al. [Theor. Comput. Sci., 2016]. We apply our bound for k-mismatch repeats to propose an algorithm that reports all non-equivalent parameterized squares in 𝒪(nσ log σ) time. We also show that the number of non-equivalent parameterized squares can be computed in 𝒪(n log n) time. This last algorithm applies to squares under any substring compatible equivalence relation and also to counting squares that are distinct as strings. In particular, this improves upon the 𝒪(nσ)-time algorithm of Gawrychowski et al. [CPM 2023] for counting order-preserving squares that are distinct as strings if σ = ω(log n).

Cite as

Yuto Nakashima, Jakub Radoszewski, and Tomasz Waleń. Fast Computation of k-Runs, Parameterized Squares, and Other Generalised Squares. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 8:1-8:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{nakashima_et_al:LIPIcs.ESA.2025.8,
  author =	{Nakashima, Yuto and Radoszewski, Jakub and Wale\'{n}, Tomasz},
  title =	{{Fast Computation of k-Runs, Parameterized Squares, and Other Generalised Squares}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{8:1--8:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.8},
  URN =		{urn:nbn:de:0030-drops-244768},
  doi =		{10.4230/LIPIcs.ESA.2025.8},
  annote =	{Keywords: string algorithm, k-mismatch square, parameterized square, order-preserving square, maximum gapped repeat}
}
Document
Efficiency of Learned Indexes on Genome Spectra

Authors: Md. Hasin Abrar, Paul Medvedev, and Giorgio Vinciguerra

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)


Abstract
Data structures on a multiset of genomic k-mers are at the heart of many bioinformatic tools. As genomic datasets grow in scale, the efficiency of these data structures increasingly depends on how well they leverage the inherent patterns in the data. One recent and effective approach is the use of learned indexes that approximate the rank function of a multiset using a piecewise linear function with very few segments. However, theoretical worst-case analysis struggles to predict the practical performance of these indexes. We address this limitation by developing a novel measure of piecewise-linear approximability of the data, called CaPLa (Canonical Piecewise Linear approximability). CaPLa builds on the empirical observation that a power-law model often serves as a reasonable proxy for piecewise linear-approximability, while explicitly accounting for deviations from a true power-law fit. We prove basic properties of CaPLa and present an efficient algorithm to compute it. We then demonstrate that CaPLa can accurately predict space bounds for data structures on real data. Empirically, we analyze over 500 genomes through the lens of CaPLa, revealing that it varies widely across the tree of life and even within individual genomes. Finally, we study the robustness of CaPLa as a measure and the factors that make genomic k-mer multisets different from random ones.

Cite as

Md. Hasin Abrar, Paul Medvedev, and Giorgio Vinciguerra. Efficiency of Learned Indexes on Genome Spectra. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 18:1-18:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{abrar_et_al:LIPIcs.ESA.2025.18,
  author =	{Abrar, Md. Hasin and Medvedev, Paul and Vinciguerra, Giorgio},
  title =	{{Efficiency of Learned Indexes on Genome Spectra}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{18:1--18:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.18},
  URN =		{urn:nbn:de:0030-drops-244865},
  doi =		{10.4230/LIPIcs.ESA.2025.18},
  annote =	{Keywords: Genome spectra, piecewise linear approximation, learned index, k-mers}
}
Document
Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars

Authors: Jannik Olbrich

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)


Abstract
The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.

Cite as

Jannik Olbrich. Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 60:1-60:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{olbrich:LIPIcs.ESA.2025.60,
  author =	{Olbrich, Jannik},
  title =	{{Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{60:1--60:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.60},
  URN =		{urn:nbn:de:0030-drops-245286},
  doi =		{10.4230/LIPIcs.ESA.2025.60},
  annote =	{Keywords: Burrows-Wheeler Transform, Grammar compression}
}
Document
MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing

Authors: Stefan Hermann

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)


Abstract
A minimal perfect hash function (MPHF) maps a set of n keys to unique positions {1, …, n}. Representing an MPHF requires at least log₂(e)≈ 1.443 bits per key. ShockHash is a technique to construct an MPHF and requires just slightly more space. It gives each key two random candidate positions. If each key can be mapped to one of its two candidate positions such that there is exactly one key mapped to each position, then an MPHF is found. If not, ShockHash repeats the process with a new set of random candidate positions. ShockHash has to store how many repetitions were required and for each key to which of the two candidate positions it is mapped. However, when a given set of candidate positions can be used as MPHF then there is not only one but multiple ways of mapping the keys to one of their candidate positions such that the mapping results in an MPHF. This redundancy makes up for the majority of the remaining space overhead in ShockHash. In this paper, we present MorphisHash which almost completely eliminates this redundancy. Our theoretical result is that MorphisHash saves Θ(ln(n)) bits in expectation compared to ShockHash. This corresponds to a factor of 20 less space overhead in practice. Just like ShockHash, MorphisHash can be used as a building block within RecSplit to obtain MorphisHash-RS. When compared for same space consumption, MorphisHash-RS can be constructed up to 21 times faster than ShockHash-RS. The technique to accomplish this might be of a more general interest to compress data structures.

Cite as

Stefan Hermann. MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 9:1-9:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{hermann:LIPIcs.ESA.2025.9,
  author =	{Hermann, Stefan},
  title =	{{MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{9:1--9:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.9},
  URN =		{urn:nbn:de:0030-drops-244779},
  doi =		{10.4230/LIPIcs.ESA.2025.9},
  annote =	{Keywords: compressed data structure, perfect hashing, random graph, pseudoforest, component}
}
Document
An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Authors: Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)


Abstract
String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold ℒ. These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Cite as

Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{sanaullah_et_al:LIPIcs.WABI.2025.17,
  author =	{Sanaullah, Ahsan and Zhi, Degui and Zhang, Shaojie},
  title =	{{An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{17:1--17:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.17},
  URN =		{urn:nbn:de:0030-drops-239433},
  doi =		{10.4230/LIPIcs.WABI.2025.17},
  annote =	{Keywords: BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome}
}
Document
Design of Worst-Case-Optimal Spaced Seeds

Authors: Jens Zentgraf and Sven Rahmann

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)


Abstract
Read mapping (and alignment) is a fundamental problem in biological sequence analysis. For speed and computational efficiency, many popular read mappers tolerate only a few differences between the read and the corresponding part of the reference genome, which leads to reference bias: Reads with too many differences are not guaranteed to be mapped correctly or at all, because to even consider a genomic position, a sufficiently long exact match (seed) must exist. While pangenomes and their graph-based representations provide one way to avoid reference bias by enlarging the reference, we explore an orthogonal approach and consider stronger substitution-tolerant primitives, namely spaced seeds or gapped k-mers. Given two integers k ≤ w, one considers k selected positions, described by a mask, from each length-w window in a sequence. In the existing literature, masks with certain probabilistic guarantees have been designed for small values of k. Here, for the first time, we take a combinatorial approach from a worst-case perspective. For any mask, using integer linear programs, we find least favorable distributions of sequence changes in two different senses: (1) minimizing the number of unchanged windows; (2) minimizing the number of positions covered by unchanged windows. Then, among all masks or all symmetric masks of a given shape (k,w), we find the set of best masks that maximize these minima. As a result, we obtain robust masks, even for large numbers of changes. We illustrate the properties of these masks by constructing a challenging set of reads that contain many approximately equidistributed substitutions (but no indels) that many existing tools cannot map, even though they are in principle easily mappable (apart from the large number of changes) because they originate from selected non-repetitive regions of the human reference genome. We observe that the majority of these reads can be mapped with a simple alignment-free approach using chosen spaced masks, where seeding approaches based on contiguous k-mers fail.

Cite as

Jens Zentgraf and Sven Rahmann. Design of Worst-Case-Optimal Spaced Seeds. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{zentgraf_et_al:LIPIcs.WABI.2025.22,
  author =	{Zentgraf, Jens and Rahmann, Sven},
  title =	{{Design of Worst-Case-Optimal Spaced Seeds}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{22:1--22:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.22},
  URN =		{urn:nbn:de:0030-drops-239488},
  doi =		{10.4230/LIPIcs.WABI.2025.22},
  annote =	{Keywords: Spaced seed, Gapped k-mer, Integer linear program (ILP), Worst-case design, Reference bias}
}
Document
Extended Abstract
Partitioned Multi-MUM Finding for Scalable Pangenomics (Extended Abstract)

Authors: Vikram S. Shivakumar and Ben Langmead

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)


Abstract
Pangenome collections continue to grow and proliferate to hundreds of high-quality genomes, for example, the expanded v2 version of the Human Pangenome Reference Consortium (HPRC) dataset spanning 474 human haplotypes [Liao et al., 2023]. As the size and complexity of these collections grow, it is increasingly important that our methods for studying and indexing pangenomes be scalable and updateable. Maximal Unique Matches (multi-MUMs), exact substring matches present exactly once in all sequences in a pangenome collection, represent conserved anchor sequences that can comprise a common coordinate system. We previously proposed a framework and tool called Mumemto for rapidly identifying multi-MUMs during construction of a compressed pangenome index [Shivakumar and Langmead, 2025]. Using prefix-free parsing (PFP) [Boucher et al., 2019], a compressed-space method for computing full-text indexes, Mumemto outperforms existing methods for identifying multi-MUMs. However, one drawback remains updateability and scalability. Mumemto can become memory-intensive for large pangenomes (> 300 human genomes), and as newly assembled genomes are added to a pangenome collection, Mumemto requires re-running on the entire updated collection. To address this, we developed a partition-merging approach to compute multi-MUMs with Mumemto. We introduce two strategies for merging of multi-MUMs computed across different collections (see Figure 1), enabling parallelization across partitions and simple computation of multi-MUMs for incrementally-updated collections. The first strategy requires a common sequence in each partition (which we call "anchor-based merging"), which serves as a coordinate system to identify multi-MUM overlaps between partitions. By tracking the next longest match for all multi-MUMs and unique matches (UMs) in an auxiliary data structure, intersections between matches can be filtered out if no longer unique in the union collection. The second strategy identifies overlaps directly from the multi-MUM substrings (called "string-based merging"). The overlaps are identified by running Mumemto over the extracted multi-MUM sequences and are similarly filtered out if they are too short to considered unique. Lastly, we propose an extension to anchor-based merging to enable the computation of partial multi-MUMs, present in only a subset of sequences in the union set. The partition-merging framework introduces a tradeoff space in Mumemto between running time and memory, depending on partition size and the number of threads. Running parallel, per-partition Mumemto processes and merging the results reduces the running time but increases the peak memory footprint, while running a single Mumemto thread over each partition serially yields longer running time but a smaller memory footprint. To evaluate this tradeoff, we computed multi-MUMs across 474 haplotypes of chr19 from the HPRC v2 dataset [Liao et al., 2023] and 69 assemblies of A. thaliana [Lian et al., 2024] (Table 1). The string-based method also enables merging multi-MUMs between disjoint collections, for example subclades in a phylogenetic tree. By merging multi-MUMs along the shape of the tree, we can compute matches at internal nodes of the tree along with the root, revealing clade-specific conservation and structural variation. Multi-MUM merging also enables interspecific match computation, which was previously infeasible with Mumemto due to high memory usage for highly-diverse input sequence collections. We use partition-merging to compute multi-MUMs across 29 primate assemblies, and found a correspondence to ultraconserved elements previously found across mammalian genomes [Cummins et al., 2024]. We show that a partitioned Mumemto enables scalability to growing pangenome collections and expands the applicability of Mumemto to larger, more diverse datasets. As a result, Mumemto is the only method capable of computing exact matches across the entire HPRC v2 dataset (474 haplotypes [Liao et al., 2023]), and can easily incorporate future releases of assemblies without recomputation. This increases the scope for exploration of genomic conservation and variation and highlights the potential for Mumemto as a core method for future pangenomics and comparative genomics research. The partitioned Mumemto framework is implemented in v1.3.0 and is available open-source at https://github.com/vikshiv/mumemto.

Cite as

Vikram S. Shivakumar and Ben Langmead. Partitioned Multi-MUM Finding for Scalable Pangenomics (Extended Abstract). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 23:1-23:4, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{shivakumar_et_al:LIPIcs.WABI.2025.23,
  author =	{Shivakumar, Vikram S. and Langmead, Ben},
  title =	{{Partitioned Multi-MUM Finding for Scalable Pangenomics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{23:1--23:4},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.23},
  URN =		{urn:nbn:de:0030-drops-239490},
  doi =		{10.4230/LIPIcs.WABI.2025.23},
  annote =	{Keywords: Pangenomics, Comparative genomics, Compressed indexing}
}
Document
Invited Talk
We Are What We Index; a Primer for the Wheeler Graph Era (Invited Talk)

Authors: Ben Langmead

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)


Abstract
Since the arrival of second-generation sequencing, we have needed to build indexes over reference sequences - e.g. genomes and transcriptomes - in order to solve read alignment and classification problems efficiently [Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2009]. The rule has been: what we can index determines what we can do. When indexing strings, we can use methods like suffix arrays [Manber and Myers, 1993], the Burrows-Wheeler Transform (BWT) [Burrows and Wheeler, 1994] / FM Index [Ferragina and Manzini, 2000], or k-mer indexes [Marchet et al., 2021]. What if we want to index objects more complex than strings? A pangenome, for example, is a large collection of similar strings, e.g. the hundreds of assemblies that make up the Human Pangenome Reference [Liao et al., 2023] or all the bacteria in the Refseq database [Goldfarb et al., 2025]. We may wish to combine these strings into a multiple sequence alignment (MSA) or a graph first. Can we index those efficiently? In many useful cases the answer is "yes," but in others the answer is "no." The story of how we learned exactly when the answer is "yes" versus "no" unfolded through a sequence of insights. Here we review this story, eventually arriving at the definition of Wheeler graphs as discovered and formalized by Gagie, Manzini and Sirén [Gagie et al., 2017]. We will focus on indexes based on the BWT, since these (a) are lossless full-text indexes, (b) are widely used in practice [Langmead et al., 2009; Li and Durbin, 2009], and (c) form the theoretical throughline for all the indexing strategies on the path to Wheeler graphs. We will trace the BWT-based indexing story from the early days of the FM Index, though its step-by-step gobbling up of trees (XBW-transform [Ferragina et al., 2005]) and de Bruijn Graphs (BOSS representation [Bowe et al., 2012]), and to the eventual formalization of Wheeler graphs [Gagie et al., 2017]. Along the way, we will define and update our notions of what it means to track a consecutive range of elements in the structure, and what it means for an index to be efficient. We will also connect these notions to automata [Sipser, 1996], noting how the indexability of Wheeler graphs (also called Wheeler automata) is connected to the mechanics of how to efficiently represent and simulate a finite automaton [Alanko et al., 2021]. With this context, we can imagine improved indexes for the future of genomics and pangenomics. De Bruijn are extremely practical and are the most widely used among the non-string data structures that are also Wheeler graphs. But we might prefer other options. For example, de Bruijn graphs have the undesirable property that they usually encode not only the true longer-than-k substrings of the original text, but also "false" substrings that span repeats. Related to this, paths through the de Bruijn graph can "glue" substrings together that are horizontally distant in the MSA. Could other Wheeler graphs be practical alternatives to de Bruijn graphs? For instance, the original GCSA study by Sirén, Välimäki and Mäkinen proposed a way to convert a multiple alignment into an automaton that either is a Wheeler graph or can be made into one [Sirén et al., 2014]. This warrants further exploration, possibly with the help of improved tools for solving the NP-complete problem of recognizing whether a graph is a Wheeler graph [Chao et al., 2023]. The notion of BWT tunnels [Baier, 2018] gives another route: we can begin with a concatenated pangenome strings and compress it by identifying and collapsing BWT tunnels. This yields a Wheeler graph that is compressed like the de Bruijn graph, but without departing from the exact contents or coordinate systems of the original genomes. The future might need us to explore all these Wheeler-graph indexes, along with the also highly practical and always-improving world of indexes buiover collections of strings [Gagie et al., 2018].

Cite as

Ben Langmead. We Are What We Index; a Primer for the Wheeler Graph Era (Invited Talk). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 2:1-2:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{langmead:LIPIcs.WABI.2025.2,
  author =	{Langmead, Ben},
  title =	{{We Are What We Index; a Primer for the Wheeler Graph Era}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{2:1--2:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.2},
  URN =		{urn:nbn:de:0030-drops-239288},
  doi =		{10.4230/LIPIcs.WABI.2025.2},
  annote =	{Keywords: Indexing, Burrows-Wheeler Transform}
}
Document
Invited Talk
Recursive Parsing and Grammar Compression in the Era of Pangenomics (Invited Talk)

Authors: Christina Boucher

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)


Abstract
Prefix-Free Parsing (PFP) and its recursive variant (RPFP) provide a scalable framework for compressing and indexing large genomic datasets. By enabling efficient construction of succinct data structures, these methods support fast and memory-efficient read alignment across thousands of genomes. Their deterministic and modular design makes them especially well-suited for pangenomics and large-scale sequence analysis.

Cite as

Christina Boucher. Recursive Parsing and Grammar Compression in the Era of Pangenomics (Invited Talk). In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 1:1-1:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{boucher:LIPIcs.WABI.2025.1,
  author =	{Boucher, Christina},
  title =	{{Recursive Parsing and Grammar Compression in the Era of Pangenomics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{1:1--1:2},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.1},
  URN =		{urn:nbn:de:0030-drops-239278},
  doi =		{10.4230/LIPIcs.WABI.2025.1},
  annote =	{Keywords: Prefix-Free Parsing, Recursive Prefix-Free Parsing, Grammar-Based Compression, Succinct Data Structures, RePair Compression}
}
Document
Research
Faster Run-Length Compressed Suffix Arrays

Authors: Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)


Abstract
We first review how we can store a run-length compressed suffix array (RLCSA) for a text T of length n over an alphabet of size σ whose Burrows-Wheeler Transform (BWT) consists of r runs in O (r log (n / r) + r log σ + σ) bits such that later, given character a and the suffix-array (SA) interval for P, we can find the SA interval for a P in O (log r_a + log log n) time, where r_a is the number of runs of copies of a in the BWT. We then show how to modify the RLCSA such that we find the SA interval for a P in only O (log r_a) time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.

Cite as

Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster Run-Length Compressed Suffix Arrays. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{brown_et_al:OASIcs.Grossi.10,
  author =	{Brown, Nathaniel K. and Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Sciortino, Marinella},
  title =	{{Faster Run-Length Compressed Suffix Arrays}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{10:1--10:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.10},
  URN =		{urn:nbn:de:0030-drops-238095},
  doi =		{10.4230/OASIcs.Grossi.10},
  annote =	{Keywords: Run-length compressed suffix arrays, interpolative coding, two-level indexing}
}
Document
Research
Subsequence-Based Indices for Genome Sequence Analysis

Authors: Giovanni Buzzega, Alessio Conte, Veronica Guerrini, Giulia Punzi, Giovanna Rosone, and Lorenzo Tattini

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)


Abstract
Compact indices are a fundamental tool in string analysis, even more so in bioinformatics, where genomic sequences can reach billions in length. This paper presents some recent results in which Roberto Grossi has been involved, showing how some of these indices do more than just efficiently represent data, but rather are able to bring out salient information within it, which can be exploited for their downstream analysis. Specifically, we first review a recently-introduced method [Guerrini et al., 2023] that employs the Burrows-Wheeler Transform to build reasonably accurate phylogenetic trees in an assembly-free scenario. We then describe a recent practical tool [Buzzega et al., 2025] for indexing Maximal Common Subsequences between strings, which can enable analysis of genomic sequence similarity. Experimentally, we show that the results produced by the one index are consistent with the expectations about the results of the other index.

Cite as

Giovanni Buzzega, Alessio Conte, Veronica Guerrini, Giulia Punzi, Giovanna Rosone, and Lorenzo Tattini. Subsequence-Based Indices for Genome Sequence Analysis. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 20:1-20:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{buzzega_et_al:OASIcs.Grossi.20,
  author =	{Buzzega, Giovanni and Conte, Alessio and Guerrini, Veronica and Punzi, Giulia and Rosone, Giovanna and Tattini, Lorenzo},
  title =	{{Subsequence-Based Indices for Genome Sequence Analysis}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{20:1--20:21},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.20},
  URN =		{urn:nbn:de:0030-drops-238199},
  doi =		{10.4230/OASIcs.Grossi.20},
  annote =	{Keywords: String Indices, Burrows-Wheeler Transform, Maximal Common Subsequences, Sequence Analysis, Phylogeny}
}
Document
BWT for String Collections

Authors: Davide Cenzato, Zsuzsanna Lipták, Nadia Pisanti, Giovanna Rosone, and Marinella Sciortino

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)


Abstract
We survey the different methods used for extending the BWT to collections of strings, following largely [Cenzato and Lipták, CPM 2022, Bioinformatics 2024]. We analyze the specific aspects and combinatorial properties of the resulting BWT variants and give a categorization of publicly available tools for computing the BWT of string collections. We show how the specific method used impacts on the resulting transform, including the number of runs, and on the dynamicity of the transform with respect to adding or removing strings from the collection. We then focus on the number of runs of these BWT variants and present the optimal BWT introduced in [Cenzato et al., DCC 2023], which implements an algorithm originally proposed by [Bentley et al., ESA 2020] to minimize the number of BWT-runs. We also discuss several recent heuristics and study their impact on the compression of biological sequences. We conclude with an overview of the applications and the impact of the BWT of string collections in bioinformatics.

Cite as

Davide Cenzato, Zsuzsanna Lipták, Nadia Pisanti, Giovanna Rosone, and Marinella Sciortino. BWT for String Collections. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 3:1-3:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{cenzato_et_al:OASIcs.Manzini.3,
  author =	{Cenzato, Davide and Lipt\'{a}k, Zsuzsanna and Pisanti, Nadia and Rosone, Giovanna and Sciortino, Marinella},
  title =	{{BWT for String Collections}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{3:1--3:29},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.3},
  URN =		{urn:nbn:de:0030-drops-239113},
  doi =		{10.4230/OASIcs.Manzini.3},
  annote =	{Keywords: Burrows-Wheeler transform, Extended Burrows-Wheeler transform, compressed text indexes, text compression, string collections, bioinformatics}
}
Document
BWT and Combinatorics on Words

Authors: Gabriele Fici, Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, and Marinella Sciortino

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)


Abstract
The Burrows-Wheeler Transform (BWT) is a reversible transformation on words (strings) introduced in 1994 in the context of data compression, which is a permutation of the characters in the word. Its clustering effect, i.e., the remarkable property of grouping identical characters (BWT runs) when they share common contexts, has made it a powerful tool for boosting compression performances and enabling efficient pattern searching in highly repetitive string collections. In this chapter, we analyze the Burrows-Wheeler transform under the combinatorial point of view, and we survey known properties and connections with different aspects of combinatorics on words. In particular, we focus on the properties of words in relation to the number of their BWT runs. The value r, which counts the number of BWT runs, impacts both compression performance and indexing efficiency, and is considered a measure to evaluate the above-mentioned clustering effect and, consequently, the repetitiveness of a word. We give an overview of the results relating r to other combinatorial repetitiveness measures related to the factor complexity. The chapter also explores extremal cases of the clustering effect. Finally, some results on the sensitivity of the measure r are considered, where the effects of combinatorial operations are studied, such as reversal, edits, and the application of morphisms.

Cite as

Gabriele Fici, Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, and Marinella Sciortino. BWT and Combinatorics on Words. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 1:1-1:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{fici_et_al:OASIcs.Manzini.1,
  author =	{Fici, Gabriele and Mantaci, Sabrina and Restivo, Antonio and Romana, Giuseppe and Rosone, Giovanna and Sciortino, Marinella},
  title =	{{BWT and Combinatorics on Words}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{1:1--1:23},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.1},
  URN =		{urn:nbn:de:0030-drops-239090},
  doi =		{10.4230/OASIcs.Manzini.1},
  annote =	{Keywords: Burrows-Wheeler Transform, Combinatorics on Words, Clustering Effect, BWT Runs}
}
  • Refine by Type
  • 72 Document/PDF
  • 35 Document/HTML
  • 1 Artifact
  • 1 Volume

  • Refine by Publication Year
  • 1 2026
  • 34 2025
  • 4 2024
  • 2 2023
  • 29 2022
  • Show More...

  • Refine by Author
  • 12 Boucher, Christina
  • 6 Gagie, Travis
  • 6 Rosone, Giovanna
  • 5 Langmead, Ben
  • 5 Rahmann, Sven
  • Show More...

  • Refine by Series/Journal
  • 59 LIPIcs
  • 13 OASIcs

  • Refine by Classification
  • 18 Applied computing → Bioinformatics
  • 13 Theory of computation → Data structures design and analysis
  • 12 Theory of computation → Design and analysis of algorithms
  • 11 Theory of computation → Pattern matching
  • 10 Theory of computation → Data compression
  • Show More...

  • Refine by Keyword
  • 10 Burrows-Wheeler Transform
  • 5 Burrows-Wheeler transform
  • 4 Comparative genomics
  • 3 FM-index
  • 3 k-mers
  • Show More...

Any Issues?
X

Feedback on the Current Page

CAPTCHA

Thanks for your feedback!

Feedback submitted to Dagstuhl Publishing

Could not send message

Please try again later or send an E-mail