Search Results

Documents authored by Pibiri, Giulio Ermanno


Artifact
Software
Fulgor

Authors: Giulio Ermanno Pibiri and Alessio Campanelli


Abstract

Cite as

Giulio Ermanno Pibiri, Alessio Campanelli. Fulgor (Software, Source Code). Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@misc{dagstuhl-artifact-24311,
   title = {{Fulgor}}, 
   author = {Pibiri, Giulio Ermanno and Campanelli, Alessio},
   note = {Software, version 4.0.0., swhId: \href{https://archive.softwareheritage.org/swh:1:dir:c9c71b24bc997397673f39f0cb7905660f09ff88;origin=https://github.com/jermp/fulgor;visit=swh:1:snp:f2aa3209d4aa6194f386a03c1d72b6397dc7f594;anchor=swh:1:rev:13cd36de656e27a8082a5fed127598548e79fd0c}{\texttt{swh:1:dir:c9c71b24bc997397673f39f0cb7905660f09ff88}} (visited on 2025-08-15)},
   url = {https://github.com/jermp/fulgor},
   doi = {10.4230/artifacts.24311},
}
Document
Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs

Authors: Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)


Abstract
Motivation. Indexes for the colored de Bruijn graph (c-dBG) play a crucial role in computational biology by facilitating complex tasks such as read mapping and assembly. These indexes map k-mers (substrings of length k) appearing in a large collection of reference strings to the set of identifiers of the strings where they appear. These sets, colloquially referred to as color sets, tend to occupy large quantities of memory, especially for large pangenomes. Our previous work thus focused on leveraging the repetitiveness of the color sets to improve the space effectiveness of the resulting index. As a matter of fact, repetition-aware indexes can be up to one order of magnitude smaller on large pangenomes compared to indexes that do not exploit such repetitiveness. Such improved space effectiveness, on the other hand, imposes an overhead at query time when performing tasks such as pseudoalignment that require the collection and processing of multiple related color sets. Methods. In this paper, we show how to avoid this overhead. We devise novel query algorithms tailored for the specific repetition-aware representations adopted by the Fulgor index, a state-of-the-art c-dBG index, to significantly improve its pseudoalignment efficiency and without consuming additional space. Results. Our results indicate that with increasing redundancy in the pangenomes, the compression factor provided by the Fulgor index increases, while the relative query time actually reduces. For example, while the space of the Fulgor index improves by 2.5× with repetition-aware compression and its query time improves by 1.6× on a collection of 5,000 Salmonella Enterica genomes, these factors become (6.1×,2.8×) and (11.2×,3.2×) for 50,000 and 150,000 genomes respectively. For an even larger collection of 300,000 genomes, we obtained an index that is 22.3× smaller and 2.2× faster.

Cite as

Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro. Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{campanelli_et_al:LIPIcs.WABI.2025.6,
  author =	{Campanelli, Alessio and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{6:1--6:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.6},
  URN =		{urn:nbn:de:0030-drops-239327},
  doi =		{10.4230/LIPIcs.WABI.2025.6},
  annote =	{Keywords: Colored de Bruijn graphs, Pseudoalignment, Repetition-aware compression}
}
Document
U-Index: A Universal Indexing Framework for Matching Long Patterns

Authors: Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P. Pissis

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)


Abstract
Motivation. Text indexing is a fundamental and well-studied problem. Classic solutions to this problem either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching, e.g., the suffix array. The former solutions thus retain excellent compressed space, but are practically slow to construct and query. The latter approaches, instead, sacrifice space efficiency but are typically faster; for example, the suffix array takes much more space than the text itself for commonly used alphabets, like ASCII or DNA, but it is very fast to construct and query. Methods. In this paper, we show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. Results. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so we demonstrate applications in long-read mapping.

Cite as

Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P. Pissis. U-Index: A Universal Indexing Framework for Matching Long Patterns. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 4:1-4:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)


Copy BibTex To Clipboard

@InProceedings{ayad_et_al:LIPIcs.SEA.2025.4,
  author =	{Ayad, Lorraine A. K. and Fici, Gabriele and Groot Koerkamp, Ragnar and Loukides, Grigorios and Patro, Rob and Pibiri, Giulio Ermanno and Pissis, Solon P.},
  title =	{{U-Index: A Universal Indexing Framework for Matching Long Patterns}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{4:1--4:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.4},
  URN =		{urn:nbn:de:0030-drops-232420},
  doi =		{10.4230/LIPIcs.SEA.2025.4},
  annote =	{Keywords: Text Indexing, Sketching, Minimizers, Hashing}
}
Document
PHOBIC: Perfect Hashing With Optimized Bucket Sizes and Interleaved Coding

Authors: Stefan Hermann, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer

Published in: LIPIcs, Volume 308, 32nd Annual European Symposium on Algorithms (ESA 2024)


Abstract
A minimal perfect hash function (or MPHF) maps a set of n keys to [n] : = {1, …, n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal choice for the expected bucket sizes, improving construction throughput for space efficient configurations both in theory and practice. Further contributions include a new encoding scheme for seeds that works across partitions of the data structure and a GPU parallelization. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. For a configuration with fast queries, our GPU implementation can construct an MPHF at 2.17 bits/key in 28 ns/key, which can be queried in 37 ns/query on the CPU.

Cite as

Stefan Hermann, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer. PHOBIC: Perfect Hashing With Optimized Bucket Sizes and Interleaved Coding. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 69:1-69:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{hermann_et_al:LIPIcs.ESA.2024.69,
  author =	{Hermann, Stefan and Lehmann, Hans-Peter and Pibiri, Giulio Ermanno and Sanders, Peter and Walzer, Stefan},
  title =	{{PHOBIC: Perfect Hashing With Optimized Bucket Sizes and Interleaved Coding}},
  booktitle =	{32nd Annual European Symposium on Algorithms (ESA 2024)},
  pages =	{69:1--69:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-338-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{308},
  editor =	{Chan, Timothy and Fischer, Johannes and Iacono, John and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2024.69},
  URN =		{urn:nbn:de:0030-drops-211405},
  doi =		{10.4230/LIPIcs.ESA.2024.69},
  annote =	{Keywords: Compressed Data Structures, Minimal Perfect Hashing, GPU}
}
Document
The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers

Authors: Ragnar Groot Koerkamp and Giulio Ermanno Pibiri

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)


Abstract
Motivation. Given a string S, a minimizer scheme is an algorithm defined by a triple (k,w,𝒪) that samples a subset of k-mers (k-long substrings) from a string S. Specifically, it samples the smallest k-mer according to the order 𝒪 from each window of w consecutive k-mers in S. Because consecutive windows can sample the same k-mer, the set of the sampled k-mers is typically much smaller than S. More generally, we consider substring sampling algorithms that respect a window guarantee: at least one k-mer must be sampled from every window of w consecutive k-mers. As a sampled k-mer is uniquely identified by its absolute position in S, we can define the density of a sampling algorithm as the fraction of distinct sampled positions. Good methods have low density which, by respecting the window guarantee, is lower bounded by 1/w. It is however difficult to design a sequence-agnostic algorithm with provably optimal density. In practice, the order 𝒪 is usually implemented using a pseudo-random hash function to obtain the so-called random minimizer. This scheme is simple to implement, very fast to compute even in streaming fashion, and easy to analyze. However, its density is almost a factor of 2 away from the lower bound for large windows. Methods. In this work we introduce mod-sampling, a two-step sampling algorithm to obtain new minimizer schemes. Given a (small) parameter t, the mod-sampling algorithm finds the position p of the smallest t-mer in a window. It then samples the k-mer at position pod w. The lr-minimizer uses t = k-w and the mod-minimizer uses t≡ k (mod w). Results. These new schemes have provably lower density than random minimizers and other schemes when k is large compared to w, while being as fast to compute. Importantly, the mod-minimizer achieves optimal density when k → ∞. Although the mod-minimizer is not the first method to achieve optimal density for large k, its proof of optimality is simpler than previous work. We provide pseudocode for a number of other methods and compare to them. In practice, the mod-minimizer has considerably lower density than the random minimizer and other state-of-the-art methods, like closed syncmers and miniception, when k > w. We plugged the mod-minimizer into SSHash, a k-mer dictionary based on minimizers. For default parameters (w,k) = (11,21), space usage decreases by 15% when indexing the whole human genome (GRCh38), while maintaining its fast query time.

Cite as

Ragnar Groot Koerkamp and Giulio Ermanno Pibiri. The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 11:1-11:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{grootkoerkamp_et_al:LIPIcs.WABI.2024.11,
  author =	{Groot Koerkamp, Ragnar and Pibiri, Giulio Ermanno},
  title =	{{The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{11:1--11:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.11},
  URN =		{urn:nbn:de:0030-drops-206552},
  doi =		{10.4230/LIPIcs.WABI.2024.11},
  annote =	{Keywords: Minimizers, Randomized algorithms, Sketching, Hashing}
}
Document
Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Authors: Jason Fan, Noor Pratap Singh, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)


Abstract
The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.

Cite as

Jason Fan, Noor Pratap Singh, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro. Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)


Copy BibTex To Clipboard

@InProceedings{fan_et_al:LIPIcs.WABI.2023.18,
  author =	{Fan, Jason and Singh, Noor Pratap and Khan, Jamshed and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fulgor: A Fast and Compact \{k-mer\} Index for Large-Scale Matching and Color Queries}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{18:1--18:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.18},
  URN =		{urn:nbn:de:0030-drops-186446},
  doi =		{10.4230/LIPIcs.WABI.2023.18},
  annote =	{Keywords: k-mers, Colored de Bruijn Graph, Compression, Read-mapping}
}
Document
On Weighted k-mer Dictionaries

Authors: Giulio Ermanno Pibiri

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)


Abstract
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

Cite as

Giulio Ermanno Pibiri. On Weighted k-mer Dictionaries. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 9:1-9:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


Copy BibTex To Clipboard

@InProceedings{pibiri:LIPIcs.WABI.2022.9,
  author =	{Pibiri, Giulio Ermanno},
  title =	{{On Weighted k-mer Dictionaries}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{9:1--9:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.9},
  URN =		{urn:nbn:de:0030-drops-170439},
  doi =		{10.4230/LIPIcs.WABI.2022.9},
  annote =	{Keywords: K-Mers, Weights, Compression, Hashing}
}
Document
Dynamic Elias-Fano Representation

Authors: Giulio Ermanno Pibiri and Rossano Venturini

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)


Abstract
We show that it is possible to store a dynamic ordered set S of n integers drawn from a bounded universe of size u in space close to the information-theoretic lower bound and preserve, at the same time, the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of monotone integer sequences, which can be shown to be less than half a bit per element away from the information-theoretic minimum. In particular, considering a RAM model with memory word size Theta(log u) bits, when integers are drawn from a polynomial universe of size u = n^gamma for any gamma = Theta(1), we add o(n) bits to the static Elias-Fano representation in order to: 1. support static predecessor/successor queries in O(min{1+log(u/n), loglog n}); 2. make S grow in an append-only fashion by spending O(1) per inserted element; 3. describe a dynamic data structure supporting random access in O(log n / loglog n) worst-case, insertions/deletions in O(log n / loglog n) amortized and predecessor/successor queries in O(min{1+log(u/n), loglog n}) worst-case time. These time bounds are optimal.

Cite as

Giulio Ermanno Pibiri and Rossano Venturini. Dynamic Elias-Fano Representation. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 30:1-30:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{pibiri_et_al:LIPIcs.CPM.2017.30,
  author =	{Pibiri, Giulio Ermanno and Venturini, Rossano},
  title =	{{Dynamic Elias-Fano Representation}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{30:1--30:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.30},
  URN =		{urn:nbn:de:0030-drops-73244},
  doi =		{10.4230/LIPIcs.CPM.2017.30},
  annote =	{Keywords: succinct data structures, integer sets, predecessor problem, Elias-Fano}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail