DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2025.6

Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs

Authors: Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Motivation. Indexes for the colored de Bruijn graph (c-dBG) play a crucial role in computational biology by facilitating complex tasks such as read mapping and assembly. These indexes map k-mers (substrings of length k) appearing in a large collection of reference strings to the set of identifiers of the strings where they appear. These sets, colloquially referred to as color sets, tend to occupy large quantities of memory, especially for large pangenomes. Our previous work thus focused on leveraging the repetitiveness of the color sets to improve the space effectiveness of the resulting index. As a matter of fact, repetition-aware indexes can be up to one order of magnitude smaller on large pangenomes compared to indexes that do not exploit such repetitiveness. Such improved space effectiveness, on the other hand, imposes an overhead at query time when performing tasks such as pseudoalignment that require the collection and processing of multiple related color sets. Methods. In this paper, we show how to avoid this overhead. We devise novel query algorithms tailored for the specific repetition-aware representations adopted by the Fulgor index, a state-of-the-art c-dBG index, to significantly improve its pseudoalignment efficiency and without consuming additional space. Results. Our results indicate that with increasing redundancy in the pangenomes, the compression factor provided by the Fulgor index increases, while the relative query time actually reduces. For example, while the space of the Fulgor index improves by 2.5× with repetition-aware compression and its query time improves by 1.6× on a collection of 5,000 Salmonella Enterica genomes, these factors become (6.1×,2.8×) and (11.2×,3.2×) for 50,000 and 150,000 genomes respectively. For an even larger collection of 300,000 genomes, we obtained an index that is 22.3× smaller and 2.2× faster.

Cite as

Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro. Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{campanelli_et_al:LIPIcs.WABI.2025.6,
  author =	{Campanelli, Alessio and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{6:1--6:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.6},
  URN =		{urn:nbn:de:0030-drops-239327},
  doi =		{10.4230/LIPIcs.WABI.2025.6},
  annote =	{Keywords: Colored de Bruijn graphs, Pseudoalignment, Repetition-aware compression}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.6

Compact Data Structures for Collections of Sets

Authors: Jarno N. Alanko, Philip Bille, Inge Li Gørtz, Gonzalo Navarro, and Simon J. Puglisi

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

We define a new entropy measure L(𝒮), called the containment entropy, for a set 𝒮 of sets, which considers the fact that some sets can be contained in others. We show how to represent 𝒮 within space close to L(𝒮) so that any element of any set can be retrieved in logarithmic time. We extend the result to predecessor and successor queries and show how some common set operations can be implemented efficiently.

Cite as

Jarno N. Alanko, Philip Bille, Inge Li Gørtz, Gonzalo Navarro, and Simon J. Puglisi. Compact Data Structures for Collections of Sets. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 6:1-6:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:OASIcs.Grossi.6,
  author =	{Alanko, Jarno N. and Bille, Philip and G{\o}rtz, Inge Li and Navarro, Gonzalo and Puglisi, Simon J.},
  title =	{{Compact Data Structures for Collections of Sets}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{6:1--6:7},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.6},
  URN =		{urn:nbn:de:0030-drops-238051},
  doi =		{10.4230/OASIcs.Grossi.6},
  annote =	{Keywords: Compressed data structures, entropy of sets, data compression}
}

Document

DOI: 10.4230/OASIcs.Manzini.13

Graph Indexing Beyond Wheeler Graphs

Authors: Jarno N. Alanko, Elena Biagi, Massimo Equi, Veli Mäkinen, Simon J. Puglisi, Nicola Rizzo, Kunihiko Sadakane, and Jouni Sirén

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

After the discovery of the FM index, which linked the Burrows-Wheeler transform (BWT) to pattern matching on strings, several contemporaneous strands of research began on indexing more complex structures with the BWT, such as tries, finite languages, de Bruijn graphs, and aligned sequences. These directions can now be viewed as culminating in the theory of Wheeler Graphs, but sometimes they go beyond. This chapter reviews the significant body of "proto Wheeler Graph" indexes, many of which exploit characteristics of their specific case to outperform Wheeler graphs, especially in practice.

Cite as

Jarno N. Alanko, Elena Biagi, Massimo Equi, Veli Mäkinen, Simon J. Puglisi, Nicola Rizzo, Kunihiko Sadakane, and Jouni Sirén. Graph Indexing Beyond Wheeler Graphs. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 13:1-13:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:OASIcs.Manzini.13,
  author =	{Alanko, Jarno N. and Biagi, Elena and Equi, Massimo and M\"{a}kinen, Veli and Puglisi, Simon J. and Rizzo, Nicola and Sadakane, Kunihiko and Sir\'{e}n, Jouni},
  title =	{{Graph Indexing Beyond Wheeler Graphs}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{13:1--13:29},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.13},
  URN =		{urn:nbn:de:0030-drops-239215},
  doi =		{10.4230/OASIcs.Manzini.13},
  annote =	{Keywords: indexing, compression, compressed data structures, string algorithms, pattern matching}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.15

Succinct Rank Dictionaries Revisited

Authors: Saska Dönges and Simon J. Puglisi

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

We study data structures for representing sets of m elements drawn from the universe [0..n-1] that support access and rank queries. A classical approach to this problem, foundational to the fields of succinct and compact data structures, is to represent the set as a bitvector X of n bits, where X[i] = 1 iff i is a member of the set. Our particular focus in this paper is on structures taking log₂{n choose m} + o(n) bits, which stem from the so-called RRR bitvector scheme (Raman et al., ACM Trans. Alg., 2007). In RRR bitvectors, X is conceptually divided into n/b blocks of b bits each. A block containing c 1 bits is then encoded using log₂ b + log₂{b choose c} bits, where log b bits are used to encode c, and log₂{b choose c} bits are used to say which of the {b choose c} possible combinations the block represents. In all existing RRR implementations the code assigned to a block is its lexicographical rank amongst the {b choose c} combinations of its class. In this paper we explore alternative non-lexicographical assignments of codes to blocks. We show these approaches can lead to faster query times and offer relevant space-time trade-offs in practice compared to state-of-the-art implementations (Gog and Petri, Software, Prac. & Exp., 2014) from the Succinct Data Structures Library.

Cite as

Saska Dönges and Simon J. Puglisi. Succinct Rank Dictionaries Revisited. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 15:1-15:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{donges_et_al:LIPIcs.SEA.2025.15,
  author =	{D\"{o}nges, Saska and Puglisi, Simon J.},
  title =	{{Succinct Rank Dictionaries Revisited}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{15:1--15:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.15},
  URN =		{urn:nbn:de:0030-drops-232530},
  doi =		{10.4230/LIPIcs.SEA.2025.15},
  annote =	{Keywords: data structures, data compression, succinct data structures, compressed data structures, weighted de Bruijn sequence, text indexing, string algorithms}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.19

The Trie Measure, Revisited

Authors: Jarno N. Alanko, Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Bojana Kodric, and Nicola Prezza

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

In this paper, we study the following problem: given n subsets S₁, … , S_n of an integer universe U = {0,… , u-1}, having total cardinality N = ∑_{i = 1}ⁿ |S_i|, find a prefix-free encoding enc : U → {0,1}^+ minimizing the so-called trie measure, i.e., the total number of edges in the n binary tries T₁, … , T_n, where T_i is the trie packing the encoded integers {enc(x):x ∈ S_i}. We first observe that this problem is equivalent to that of merging u sets with the cheapest sequence of binary unions, a problem which in [Ghosh et al., ICDCS 2015] is shown to be NP-hard. Motivated by the hardness of the general problem, we focus on particular families of prefix-free encodings. We start by studying the fixed-length shifted encoding of [Gupta et al., Theoretical Computer Science 2007]. Given a parameter 0 ≤ a < u, this encoding sends each x ∈ U to (x + a) mod u, interpreted as a bit-string of log u bits. We develop the first efficient algorithms that find the value of a minimizing the trie measure when this encoding is used. Our two algorithms run in O(u + Nlog u) and O(Nlog² u) time, respectively. We proceed by studying ordered encodings (a.k.a. monotone or alphabetic), and describe an algorithm finding the optimal such encoding in O(N+u³) time. Within the same running time, we show how to compute the best shifted ordered encoding, provably no worse than both the optimal shifted and optimal ordered encodings. We provide implementations of our algorithms and discuss how these encodings perform in practice.

Cite as

Jarno N. Alanko, Ruben Becker, Davide Cenzato, Travis Gagie, Sung-Hwan Kim, Bojana Kodric, and Nicola Prezza. The Trie Measure, Revisited. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 19:1-19:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.CPM.2025.19,
  author =	{Alanko, Jarno N. and Becker, Ruben and Cenzato, Davide and Gagie, Travis and Kim, Sung-Hwan and Kodric, Bojana and Prezza, Nicola},
  title =	{{The Trie Measure, Revisited}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{19:1--19:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.19},
  URN =		{urn:nbn:de:0030-drops-231135},
  doi =		{10.4230/LIPIcs.CPM.2025.19},
  annote =	{Keywords: Succinct data structures, degenerate strings, integer encoding}
}

Document

DOI: 10.4230/LIPIcs.SEA.2023.4

Subset Wavelet Trees

Authors: Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi

Published in: LIPIcs, Volume 265, 21st International Symposium on Experimental Algorithms (SEA 2023)

Abstract

Given an alphabet Σ of σ = |Σ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]…, X[n-1] of n subsets of Σ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.

Cite as

Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset Wavelet Trees. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.SEA.2023.4,
  author =	{Alanko, Jarno N. and Biagi, Elena and Puglisi, Simon J. and Vuohtoniemi, Jaakko},
  title =	{{Subset Wavelet Trees}},
  booktitle =	{21st International Symposium on Experimental Algorithms (SEA 2023)},
  pages =	{4:1--4:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-279-2},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{265},
  editor =	{Georgiadis, Loukas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2023.4},
  URN =		{urn:nbn:de:0030-drops-183549},
  doi =		{10.4230/LIPIcs.SEA.2023.4},
  annote =	{Keywords: degenerate strings, compressed data structures, succinct data structures, string processing, data structures, efficient algorithms}
}

6 Search Results for "Vuohtoniemi, Jaakko"

Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs

Abstract

Cite as

Compact Data Structures for Collections of Sets

Abstract

Cite as

Graph Indexing Beyond Wheeler Graphs

Abstract

Cite as

Succinct Rank Dictionaries Revisited

Abstract

Cite as

The Trie Measure, Revisited

Abstract

Cite as

Subset Wavelet Trees

Abstract

Cite as

Thanks for your feedback!

Could not send message