33 Search Results for "Gagie, Travis"


Document
b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index

Authors: Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)


Abstract
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.’s r-index and Nishimoto and Tabei’s move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.’s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index’s favorable memory characteristics. For example, all available complete E. coli genomes on NCBI’s RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

Cite as

Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier. b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 10:1-10:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{depuydt_et_al:LIPIcs.WABI.2024.10,
  author =	{Depuydt, Lore and Renders, Luca and Van de Vyver, Simon and Veys, Lennart and Gagie, Travis and Fostier, Jan},
  title =	{{b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{10:1--10:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.10},
  URN =		{urn:nbn:de:0030-drops-206546},
  doi =		{10.4230/LIPIcs.WABI.2024.10},
  annote =	{Keywords: Pan-genomics, FM-index, r-index, Move Structure, Bidirectional Search, Approximate Pattern Matching, Lossless Alignment, Cache Efficiency}
}
Document
Algorithms and Complexity for Path Covers of Temporal DAGs

Authors: Dibyayan Chakraborty, Antoine Dailly, Florent Foucaud, and Ralf Klasing

Published in: LIPIcs, Volume 306, 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)


Abstract
A path cover of a digraph is a collection of paths collectively containing its vertex set. A path cover with minimum cardinality for a directed acyclic graph can be found in polynomial time [Fulkerson, AMS'56; Cáceres et al., SODA'22]. Moreover, Dilworth’s celebrated theorem on chain coverings of partially ordered sets equivalently states that the minimum size of a path cover of a DAG is equal to the maximum size of a set of mutually unreachable vertices. In this paper, we examine how far these classic results can be extended to a dynamic setting. A temporal digraph has an arc set that changes over discrete time-steps; if the underlying digraph is acyclic, then it is a temporal DAG. A temporal path is a directed path in the underlying digraph, such that the time-steps of arcs are strictly increasing along the path. Two temporal paths are temporally disjoint if they do not occupy any vertex at the same time. A temporal path cover is a collection 𝒞 of temporal paths that covers all vertices, and 𝒞 is temporally disjoint if all its temporal paths are pairwise temporally disjoint. We study the computational complexities of the problems of finding a minimum-size temporal (disjoint) path cover (denoted as Temporal Path Cover and Temporally Disjoint Path Cover). On the negative side, we show that both Temporal Path Cover and Temporally Disjoint Path Cover are NP-hard even when the underlying DAG is planar, bipartite, subcubic, and there are only two arc-disjoint time-steps. Moreover, Temporally Disjoint Path Cover remains NP-hard even on temporal oriented trees. We also observe that natural temporal analogues of Dilworth’s theorem on these classes of temporal DAGs do not hold. In contrast, we show that Temporal Path Cover is polynomial-time solvable on temporal oriented trees by a reduction to Clique Cover for (static undirected) weakly chordal graphs (a subclass of perfect graphs for which Clique Cover admits an efficient algorithm). This highlights an interesting algorithmic difference between the two problems. Although it is NP-hard on temporal oriented trees, Temporally Disjoint Path Cover becomes polynomial-time solvable on temporal oriented lines and temporal rooted directed trees. Motivated by the hardness result on trees, we show that, in contrast, Temporal Path Cover admits an XP time algorithm with respect to parameter t_max + tw, where t_max is the maximum time-step and tw is the treewidth of the underlying static undirected graph; moreover, Temporally Disjoint Path Cover admits an FPT algorithm with respect to the same parameterization.

Cite as

Dibyayan Chakraborty, Antoine Dailly, Florent Foucaud, and Ralf Klasing. Algorithms and Complexity for Path Covers of Temporal DAGs. In 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 306, pp. 38:1-38:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{chakraborty_et_al:LIPIcs.MFCS.2024.38,
  author =	{Chakraborty, Dibyayan and Dailly, Antoine and Foucaud, Florent and Klasing, Ralf},
  title =	{{Algorithms and Complexity for Path Covers of Temporal DAGs}},
  booktitle =	{49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)},
  pages =	{38:1--38:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-335-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{306},
  editor =	{Kr\'{a}lovi\v{c}, Rastislav and Ku\v{c}era, Anton{\'\i}n},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2024.38},
  URN =		{urn:nbn:de:0030-drops-205940},
  doi =		{10.4230/LIPIcs.MFCS.2024.38},
  annote =	{Keywords: Temporal Graphs, Dilworth’s Theorem, DAGs, Path Cover, Temporally Disjoint Paths, Algorithms, Oriented Trees, Treewidth}
}
Document
Move-r: Optimizing the r-index

Authors: Nico Bertram, Johannes Fischer, and Lukas Nalbach

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)


Abstract
We present a static text index called Move-r, which is a highly optimized version of the r-index ([Travis Gagie et al., 2020] Gagie et al., 2020) that encorporates recent theoretical developments of the move data structure ([Takaaki Nishimoto and Yasuo Tabei, 2021] Nishimoto and Tabei, 2021). The r-index is the method of choice for indexing highly repetitive texts, such as different versions of a text document or DNA from the same species, as it exploits the compressibilty of the underlying data. With Move-r, we can answer count- and locate queries 2-35 (typically 15) times as fast as with any other r-index supporting locate queries while being 0.8-2.5 (typically 2) times as large. A Move-r index can be constructed 0.9-2 (typically 2) times as fast while using 1/3-1 (typically 1/2) times as much space.

Cite as

Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 1:1-1:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{bertram_et_al:LIPIcs.SEA.2024.1,
  author =	{Bertram, Nico and Fischer, Johannes and Nalbach, Lukas},
  title =	{{Move-r: Optimizing the r-index}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{1:1--1:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.1},
  URN =		{urn:nbn:de:0030-drops-203662},
  doi =		{10.4230/LIPIcs.SEA.2024.1},
  annote =	{Keywords: Compressed Text Index, Burrows-Wheeler Transform}
}
Document
Practical Minimum Path Cover

Authors: Manuel Cáceres, Brendan Mumey, Santeri Toivonen, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)


Abstract
Computing a minimum path cover (MPC) of a directed acyclic graph (DAG) is a fundamental problem with a myriad of applications, including reachability. Although it is known how to solve the problem by a simple reduction to minimum flow, recent theoretical advances exploit this idea to obtain algorithms parameterized by the number of paths of an MPC, known as the width. These results obtain fast [Mäkinen et al., TALG 2019] and even linear time [Cáceres et al., SODA 2022] algorithms in the small-width regime. In this paper, we present the first publicly available high-performance implementation of state-of-the-art MPC algorithms, including the parameterized approaches. Our experiments on random DAGs show that parameterized algorithms are orders-of-magnitude faster on dense graphs. Additionally, we present new fast pre-processing heuristics based on transitive edge sparsification. We show that our heuristics improve MPC-solvers by orders of magnitude.

Cite as

Manuel Cáceres, Brendan Mumey, Santeri Toivonen, and Alexandru I. Tomescu. Practical Minimum Path Cover. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 3:1-3:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{caceres_et_al:LIPIcs.SEA.2024.3,
  author =	{C\'{a}ceres, Manuel and Mumey, Brendan and Toivonen, Santeri and Tomescu, Alexandru I.},
  title =	{{Practical Minimum Path Cover}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{3:1--3:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.3},
  URN =		{urn:nbn:de:0030-drops-203687},
  doi =		{10.4230/LIPIcs.SEA.2024.3},
  annote =	{Keywords: minimum path cover, directed acyclic graph, maximum flow, parameterized algorithms, edge sparsification, algorithm engineering}
}
Document
Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Authors: Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)


Abstract
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Cite as

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro. Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{draesslerova_et_al:LIPIcs.SEA.2024.10,
  author =	{Draesslerov\'{a}, Dominika and Ahmed, Omar and Gagie, Travis and Holub, Jan and Langmead, Ben and Manzini, Giovanni and Navarro, Gonzalo},
  title =	{{Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.10},
  URN =		{urn:nbn:de:0030-drops-203756},
  doi =		{10.4230/LIPIcs.SEA.2024.10},
  annote =	{Keywords: Taxonomic classification, metagenomics, KATKA, maximal exact matches, string kernels, minimizer digests}
}
Document
Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions

Authors: Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)


Abstract
Given a flow network, the Minimum Flow Decomposition (MFD) problem is finding the smallest possible set of weighted paths whose superposition equals the flow. It is a classical, strongly NP-hard problem that is proven to be useful in RNA transcript assembly and applications outside of Bioinformatics. We improve an existing ILP (Integer Linear Programming) model by Dias et al. [RECOMB 2022] for DAGs by decreasing the solver’s search space using solution safety and several other optimizations. This results in a significant speedup compared to the original ILP, of up to 34× on average on the hardest instances. Moreover, we show that our optimizations apply also to MFD problem variants, resulting in speedups that go up to 219× on the hardest instances. We also developed an ILP model of reduced dimensionality for an MFD variant in which the solution path weights are restricted to a given set. This model can find an optimal MFD solution for most instances, and overall, its accuracy significantly outperforms that of previous greedy algorithms while being up to an order of magnitude faster than our optimized ILP.

Cite as

Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu. Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{grigorjew_et_al:LIPIcs.SEA.2024.14,
  author =	{Grigorjew, Andreas and Dias, Fernando H. C. and Cracco, Andrea and Rizzi, Romeo and Tomescu, Alexandru I.},
  title =	{{Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{14:1--14:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.14},
  URN =		{urn:nbn:de:0030-drops-203792},
  doi =		{10.4230/LIPIcs.SEA.2024.14},
  annote =	{Keywords: Flow decomposition, Integer Linear Programming, Safety, RNA-seq, RNA transcript assembly, isoform}
}
Document
Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression

Authors: Patrick Dinklage, Johannes Fischer, and Nicola Prezza

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)


Abstract
We present novel online approximations of the Lempel-Ziv 77 (LZ77) and Lempel-Ziv 78 (LZ78) compression schemes [Lempel & Ziv, 1977/1978] with parameterizable space usage based on estimating which k patterns occur the most frequently in the streamed input for parameter k. This new approach overcomes the issue of finding only local repetitions, which is a natural limitation of algorithms that compress using a sliding window or by partitioning the input into blocks. For this, we introduce the top-k trie, a summary for maintaining online the top-k frequent consecutive patterns in a stream of characters based on a combination of the Lempel-Ziv 78 compression scheme and the Misra-Gries algorithm for frequent item estimation in streams. Using straightforward encoding, our implementations yield compression ratios (output over input size) competitive with established general-purpose LZ-based compression utilities such as gzip or xz.

Cite as

Patrick Dinklage, Johannes Fischer, and Nicola Prezza. Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 9:1-9:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{dinklage_et_al:LIPIcs.SEA.2024.9,
  author =	{Dinklage, Patrick and Fischer, Johannes and Prezza, Nicola},
  title =	{{Top-k Frequent Patterns in Streams and Parameterized-Space LZ Compression}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{9:1--9:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.9},
  URN =		{urn:nbn:de:0030-drops-203748},
  doi =		{10.4230/LIPIcs.SEA.2024.9},
  annote =	{Keywords: compression, streaming, heavy hitters, algorithm engineering}
}
Document
Track A: Algorithms, Complexity and Games
Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching

Authors: Kento Iseri, Tomohiro I, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara

Published in: LIPIcs, Volume 297, 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)


Abstract
A parameterized string (p-string) is a string over an alphabet (Σ_s ∪ Σ_p), where Σ_s and Σ_p are disjoint alphabets for static symbols (s-symbols) and for parameter symbols (p-symbols), respectively. Two p-strings x and y are said to parameterized match (p-match) if and only if x can be transformed into y by applying a bijection on Σ_p to every occurrence of p-symbols in x. The indexing problem for p-matching is to preprocess a p-string T of length n so that we can efficiently find the occurrences of substrings of T that p-match with a given pattern. Let σ_s and respectively σ_p be the numbers of distinct s-symbols and p-symbols that appear in T and σ = σ_s + σ_p. Extending the Burrows-Wheeler Transform (BWT) based index for exact string pattern matching, Ganguly et al. [SODA 2017] proposed parameterized BWTs (pBWTs) to design the first compact index for p-matching, and posed an open problem on how to construct the pBWT-based index in compact space, i.e., in O(n lg |Σ_s ∪ Σ_p|) bits of space. Hashimoto et al. [SPIRE 2022] showed how to construct the pBWT for T, under the assumption that Σ_s ∪ Σ_p = [0..O(σ)], in O(n lg σ) bits of space and O(n (σ_p lg n)/(lg lg n)) time in an online manner while reading the symbols of T from right to left. In this paper, we refine Hashimoto et al.’s algorithm to work in O(n lg σ) bits of space and O(n (lg σ_p lg n)/(lg lg n)) time in a more general assumption that Σ_s ∪ Σ_p = [0..n^{O(1)}]. Our result has an immediate application to constructing parameterized suffix arrays in O(n (lg σ_p lg n)/(lg lg n)) time and O(n lg σ) bits of working space. We also show that our data structure can support backward search, a core procedure of BWT-based indexes, at any stage of the online construction, making it the first compact index for p-matching that can be constructed in compact space and even in an online manner.

Cite as

Kento Iseri, Tomohiro I, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 89:1-89:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{iseri_et_al:LIPIcs.ICALP.2024.89,
  author =	{Iseri, Kento and I, Tomohiro and Hendrian, Diptarama and K\"{o}ppl, Dominik and Yoshinaka, Ryo and Shinohara, Ayumi},
  title =	{{Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{89:1--89:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.89},
  URN =		{urn:nbn:de:0030-drops-202324},
  doi =		{10.4230/LIPIcs.ICALP.2024.89},
  annote =	{Keywords: Index for parameterized pattern matching, Parameterized Burrows-Wheeler Transform, Online construction}
}
Document
Solving the Minimal Positional Substring Cover Problem in Sublinear Space

Authors: Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola

Published in: LIPIcs, Volume 296, 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)


Abstract
Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem.

Cite as

Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, and Yuri Pirola. Solving the Minimal Positional Substring Cover Problem in Sublinear Space. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{bonizzoni_et_al:LIPIcs.CPM.2024.12,
  author =	{Bonizzoni, Paola and Boucher, Christina and Cozzi, Davide and Gagie, Travis and Pirola, Yuri},
  title =	{{Solving the Minimal Positional Substring Cover Problem in Sublinear Space}},
  booktitle =	{35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024)},
  pages =	{12:1--12:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-326-3},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{296},
  editor =	{Inenaga, Shunsuke and Puglisi, Simon J.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2024.12},
  URN =		{urn:nbn:de:0030-drops-201225},
  doi =		{10.4230/LIPIcs.CPM.2024.12},
  annote =	{Keywords: Positional Burrows-Wheeler Transform, r-index, minimal position substring cover, set-maximal exact matches}
}
Document
Acceleration of FM-Index Queries Through Prefix-Free Parsing

Authors: Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)


Abstract
FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory.

Cite as

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. Acceleration of FM-Index Queries Through Prefix-Free Parsing. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 13:1-13:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{hong_et_al:LIPIcs.WABI.2023.13,
  author =	{Hong, Aaron and Oliva, Marco and K\"{o}ppl, Dominik and Bannai, Hideo and Boucher, Christina and Gagie, Travis},
  title =	{{Acceleration of FM-Index Queries Through Prefix-Free Parsing}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{13:1--13:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.13},
  URN =		{urn:nbn:de:0030-drops-186390},
  doi =		{10.4230/LIPIcs.WABI.2023.13},
  annote =	{Keywords: FM-index, pangenomics, scalability, word-based indexing, random access}
}
Document
Merging Sorted Lists of Similar Strings

Authors: Gene Myers

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)


Abstract
Merging T sorted, non-redundant lists containing M elements into a single sorted, non-redundant result of size N ≥ M/T is a classic problem typically solved practically in O(M log T) time with a priority-queue data structure the most basic of which is the simple heap. We revisit this problem in the situation where the list elements are strings and the lists contain many identical or nearly identical elements. By keeping simple auxiliary information with each heap node, we devise an O(M log T+S) worst-case method that performs no more character comparisons than the sum of the lengths of all the strings S, and another O(M log (T/e¯)+S) method that becomes progressively more efficient as a function of the fraction of equal elements e¯ = M/N between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

Cite as

Gene Myers. Merging Sorted Lists of Similar Strings. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 22:1-22:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{myers:LIPIcs.CPM.2023.22,
  author =	{Myers, Gene},
  title =	{{Merging Sorted Lists of Similar Strings}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{22:1--22:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.22},
  URN =		{urn:nbn:de:0030-drops-179763},
  doi =		{10.4230/LIPIcs.CPM.2023.22},
  annote =	{Keywords: heap, trie, longest common prefix}
}
Document
MONI Can Find k-MEMs

Authors: Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)


Abstract
Suppose we are asked to index a text T [0..n - 1] such that, given a pattern P [0..m - 1], we can quickly report the maximal substrings of P that each occur in T at least k times. We first show how we can add O (r log n) bits to Rossi et al.’s recent MONI index, where r is the number of runs in the Burrows-Wheeler Transform of T, such that it supports such queries in O (k m log n) time. We then show how, if we are given k at construction time, we can reduce the query time to O (m log n).

Cite as

Igor Tatarnikov, Ardavan Shahrabi Farahani, Sana Kashgouli, and Travis Gagie. MONI Can Find k-MEMs. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 26:1-26:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{tatarnikov_et_al:LIPIcs.CPM.2023.26,
  author =	{Tatarnikov, Igor and Shahrabi Farahani, Ardavan and Kashgouli, Sana and Gagie, Travis},
  title =	{{MONI Can Find k-MEMs}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{26:1--26:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.26},
  URN =		{urn:nbn:de:0030-drops-179802},
  doi =		{10.4230/LIPIcs.CPM.2023.26},
  annote =	{Keywords: Compact data structures, Burrows-Wheeler Transform, run-length compression, maximal exact matches}
}
Document
Simple Worst-Case Optimal Adaptive Prefix-Free Coding

Authors: Travis Gagie

Published in: LIPIcs, Volume 244, 30th Annual European Symposium on Algorithms (ESA 2022)


Abstract
We give a new and simple worst-case optimal algorithm for adaptive prefix-free coding that matches Gagie and Nekrich’s (2009) bounds except for lower-order terms, and uses no data structures more complicated than a lookup table.

Cite as

Travis Gagie. Simple Worst-Case Optimal Adaptive Prefix-Free Coding. In 30th Annual European Symposium on Algorithms (ESA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 244, pp. 57:1-57:5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{gagie:LIPIcs.ESA.2022.57,
  author =	{Gagie, Travis},
  title =	{{Simple Worst-Case Optimal Adaptive Prefix-Free Coding}},
  booktitle =	{30th Annual European Symposium on Algorithms (ESA 2022)},
  pages =	{57:1--57:5},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-247-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{244},
  editor =	{Chechik, Shiri and Navarro, Gonzalo and Rotenberg, Eva and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.57},
  URN =		{urn:nbn:de:0030-drops-169959},
  doi =		{10.4230/LIPIcs.ESA.2022.57},
  annote =	{Keywords: Adaptive prefix-free coding, Shannon coding, Lookup tables}
}
Document
Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs

Authors: Adrián Goga and Andrej Baláž

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)


Abstract
We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.

Cite as

Adrián Goga and Andrej Baláž. Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 18:1-18:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{goga_et_al:LIPIcs.WABI.2022.18,
  author =	{Goga, Adri\'{a}n and Bal\'{a}\v{z}, Andrej},
  title =	{{Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{18:1--18:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.18},
  URN =		{urn:nbn:de:0030-drops-170529},
  doi =		{10.4230/LIPIcs.WABI.2022.18},
  annote =	{Keywords: Wheeler graphs, BWT tunnelling, prefix-free parsing, pangenomic graphs}
}
Document
Pangenomic Genotyping with the Marker Array

Authors: Taher Mun, Naga Sai Kavya Vaddadi, and Ben Langmead

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)


Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.

Cite as

Taher Mun, Naga Sai Kavya Vaddadi, and Ben Langmead. Pangenomic Genotyping with the Marker Array. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 19:1-19:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{mun_et_al:LIPIcs.WABI.2022.19,
  author =	{Mun, Taher and Vaddadi, Naga Sai Kavya and Langmead, Ben},
  title =	{{Pangenomic Genotyping with the Marker Array}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{19:1--19:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.19},
  URN =		{urn:nbn:de:0030-drops-170530},
  doi =		{10.4230/LIPIcs.WABI.2022.19},
  annote =	{Keywords: Sequence alignment indexing genotyping}
}
  • Refine by Author
  • 18 Gagie, Travis
  • 6 Manzini, Giovanni
  • 6 Navarro, Gonzalo
  • 3 Boucher, Christina
  • 3 Rossi, Massimiliano
  • Show More...

  • Refine by Classification
  • 9 Theory of computation → Pattern matching
  • 5 Theory of computation → Data compression
  • 3 Theory of computation → Data structures design and analysis
  • 3 Theory of computation → Design and analysis of algorithms
  • 2 Applied computing → Bioinformatics
  • Show More...

  • Refine by Keyword
  • 7 Burrows-Wheeler Transform
  • 4 r-index
  • 3 FM-index
  • 2 Compact data structures
  • 2 algorithm engineering
  • Show More...

  • Refine by Type
  • 33 document

  • Refine by Publication Year
  • 9 2024
  • 6 2022
  • 4 2017
  • 3 2018
  • 3 2020
  • Show More...