4 Search Results for "Khan, Jamshed"

Document

DOI: 10.4230/LIPIcs.WABI.2025.16

Estimation of Substitution and Indel Rates via k-mer Statistics

Authors: Mahmudur Rahman Hera, Paul Medvedev, David Koslicki, and Antonio Blanca

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Methods utilizing k-mers are widely used in bioinformatics, yet our understanding of their statistical properties under realistic mutation models remains incomplete. Previously, substitution-only mutation models have been considered to derive precise expectations and variances for mutated k-mers and intervals of mutated and non-mutated sequences. In this work, we consider a mutation model that incorporates insertions and deletions in addition to single-nucleotide substitutions. Within this framework, we derive closed-form k-mer-based estimators for the three fundamental mutation parameters: substitution, deletion rate, and insertion rates. We provide theoretical guarantees in the form of concentration inequalities, ensuring accuracy of our estimators under reasonable model assumptions. Empirical evaluations on simulated evolution of genomic sequences confirm our theoretical findings, demonstrating that accounting for insertions and deletions signals allows for accurate estimation of mutation rates and improves upon the results obtained by considering a substitution-only model. An implementation of estimating the mutation parameters from a pair of fasta files is available here: https://github.com/KoslickiLab/estimate_rates_using_mutation_model.git. The results presented in this manuscript can be reproduced using the code available here: https://github.com/KoslickiLab/est_rates_experiments.git.

Cite as

Mahmudur Rahman Hera, Paul Medvedev, David Koslicki, and Antonio Blanca. Estimation of Substitution and Indel Rates via k-mer Statistics. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{rahmanhera_et_al:LIPIcs.WABI.2025.16,
  author =	{Rahman Hera, Mahmudur and Medvedev, Paul and Koslicki, David and Blanca, Antonio},
  title =	{{Estimation of Substitution and Indel Rates via k-mer Statistics}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{16:1--16:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.16},
  URN =		{urn:nbn:de:0030-drops-239422},
  doi =		{10.4230/LIPIcs.WABI.2025.16},
  annote =	{Keywords: k-mers, mutation rate, indel, alignment-free, estimation, substitution, insertion, deletion}
}

Document

DOI: 10.4230/LIPIcs.WABI.2025.6

Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs

Authors: Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Motivation. Indexes for the colored de Bruijn graph (c-dBG) play a crucial role in computational biology by facilitating complex tasks such as read mapping and assembly. These indexes map k-mers (substrings of length k) appearing in a large collection of reference strings to the set of identifiers of the strings where they appear. These sets, colloquially referred to as color sets, tend to occupy large quantities of memory, especially for large pangenomes. Our previous work thus focused on leveraging the repetitiveness of the color sets to improve the space effectiveness of the resulting index. As a matter of fact, repetition-aware indexes can be up to one order of magnitude smaller on large pangenomes compared to indexes that do not exploit such repetitiveness. Such improved space effectiveness, on the other hand, imposes an overhead at query time when performing tasks such as pseudoalignment that require the collection and processing of multiple related color sets. Methods. In this paper, we show how to avoid this overhead. We devise novel query algorithms tailored for the specific repetition-aware representations adopted by the Fulgor index, a state-of-the-art c-dBG index, to significantly improve its pseudoalignment efficiency and without consuming additional space. Results. Our results indicate that with increasing redundancy in the pangenomes, the compression factor provided by the Fulgor index increases, while the relative query time actually reduces. For example, while the space of the Fulgor index improves by 2.5× with repetition-aware compression and its query time improves by 1.6× on a collection of 5,000 Salmonella Enterica genomes, these factors become (6.1×,2.8×) and (11.2×,3.2×) for 50,000 and 150,000 genomes respectively. For an even larger collection of 300,000 genomes, we obtained an index that is 22.3× smaller and 2.2× faster.

Cite as

Alessio Campanelli, Giulio Ermanno Pibiri, and Rob Patro. Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{campanelli_et_al:LIPIcs.WABI.2025.6,
  author =	{Campanelli, Alessio and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{6:1--6:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.6},
  URN =		{urn:nbn:de:0030-drops-239327},
  doi =		{10.4230/LIPIcs.WABI.2025.6},
  annote =	{Keywords: Colored de Bruijn graphs, Pseudoalignment, Repetition-aware compression}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.16

Fast, Parallel, and Cache-Friendly Suffix Array Construction

Authors: Jamshed Khan, Tobias Rubel, Laxman Dhulipala, Erin Molloy, and Rob Patro

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups.

Cite as

Jamshed Khan, Tobias Rubel, Laxman Dhulipala, Erin Molloy, and Rob Patro. Fast, Parallel, and Cache-Friendly Suffix Array Construction. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 16:1-16:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{khan_et_al:LIPIcs.WABI.2023.16,
  author =	{Khan, Jamshed and Rubel, Tobias and Dhulipala, Laxman and Molloy, Erin and Patro, Rob},
  title =	{{Fast, Parallel, and Cache-Friendly Suffix Array Construction}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{16:1--16:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.16},
  URN =		{urn:nbn:de:0030-drops-186424},
  doi =		{10.4230/LIPIcs.WABI.2023.16},
  annote =	{Keywords: Suffix Array, Longest Common Prefix, Data Structures, Indexing, Parallel Algorithms}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.18

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Authors: Jason Fan, Noor Pratap Singh, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.

Cite as

Jason Fan, Noor Pratap Singh, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro. Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 18:1-18:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{fan_et_al:LIPIcs.WABI.2023.18,
  author =	{Fan, Jason and Singh, Noor Pratap and Khan, Jamshed and Pibiri, Giulio Ermanno and Patro, Rob},
  title =	{{Fulgor: A Fast and Compact \{k-mer\} Index for Large-Scale Matching and Color Queries}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{18:1--18:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.18},
  URN =		{urn:nbn:de:0030-drops-186446},
  doi =		{10.4230/LIPIcs.WABI.2023.18},
  annote =	{Keywords: k-mers, Colored de Bruijn Graph, Compression, Read-mapping}
}

Refine by Type
4 Document/PDF
2 Document/HTML

Refine by Publication Year
2 2025
2 2023

Refine by Author
3 Patro, Rob
2 Khan, Jamshed
2 Pibiri, Giulio Ermanno
1 Blanca, Antonio
1 Campanelli, Alessio
Show More...

Refine by Series/Journal
4 LIPIcs

Refine by Classification
2 Applied computing → Bioinformatics
1 Applied computing → Computational biology
1 Mathematics of computing → Probabilistic inference problems
1 Theory of computation → Sorting and searching
1 Theory of computation → Theory and algorithms for application domains

Refine by Keyword
2 k-mers
1 Colored de Bruijn Graph
1 Colored de Bruijn graphs
1 Compression
1 Data Structures
Show More...

4 Search Results for "Khan, Jamshed"

Estimation of Substitution and Indel Rates via k-mer Statistics

Abstract

Cite as

Fast Pseudoalignment Queries on Compressed Colored de Bruijn Graphs

Abstract

Cite as

Fast, Parallel, and Cache-Friendly Suffix Array Construction

Abstract

Cite as

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Abstract

Cite as

Thanks for your feedback!

Could not send message