DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2025.4

Haplotype-Aware Long-Read Error Correction

Authors: Parvesh Barak, Daniel Gibney, and Chirag Jain

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

Error correction of long reads is an important initial step in genome assembly workflows. For organisms with ploidy greater than one, it is important to preserve haplotype-specific variation during read correction. This challenge has driven the development of several haplotype-aware correction methods. However, existing methods are based on either ad-hoc heuristics or deep learning approaches. In this paper, we introduce a rigorous formulation for this problem. Our approach builds on the minimum error correction framework used in reference-based haplotype phasing. We prove that the proposed formulation for error correction of reads in de novo context, i.e., without using a reference genome, is NP-hard. To make our exact algorithm scale to large datasets, we introduce practical heuristics. Experiments using PacBio HiFi sequencing datasets from human and plant genomes show that our approach achieves accuracy comparable to state-of-the-art methods. The software is freely available at https://github.com/at-cg/HALE.

Cite as

Parvesh Barak, Daniel Gibney, and Chirag Jain. Haplotype-Aware Long-Read Error Correction. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 4:1-4:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{barak_et_al:LIPIcs.WABI.2025.4,
  author =	{Barak, Parvesh and Gibney, Daniel and Jain, Chirag},
  title =	{{Haplotype-Aware Long-Read Error Correction}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{4:1--4:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.4},
  URN =		{urn:nbn:de:0030-drops-239300},
  doi =		{10.4230/LIPIcs.WABI.2025.4},
  annote =	{Keywords: Genome assembly, phasing, clustering, overlap graph, consensus}
}

Document

DOI: 10.4230/LIPIcs.WABI.2023.12

Co-Linear Chaining on Pangenome Graphs

Authors: Jyotshna Rajput, Ghanshyam Chandra, and Chirag Jain

Published in: LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Abstract

Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width [Makinen et al., TALG'19] and how incorporating gap cost in the scoring function improves alignment accuracy [Chandra and Jain, RECOMB'23]. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy.

Cite as

Jyotshna Rajput, Ghanshyam Chandra, and Chirag Jain. Co-Linear Chaining on Pangenome Graphs. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 12:1-12:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{rajput_et_al:LIPIcs.WABI.2023.12,
  author =	{Rajput, Jyotshna and Chandra, Ghanshyam and Jain, Chirag},
  title =	{{Co-Linear Chaining on Pangenome Graphs}},
  booktitle =	{23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)},
  pages =	{12:1--12:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-294-5},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{273},
  editor =	{Belazzougui, Djamal and Ouangraoua, A\"{i}da},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.12},
  URN =		{urn:nbn:de:0030-drops-186389},
  doi =		{10.4230/LIPIcs.WABI.2023.12},
  annote =	{Keywords: Sequence alignment, variation graph, genome sequencing, path cover}
}

Document

DOI: 10.4230/LIPIcs.WABI.2019.17

Validating Paired-End Read Alignments in Sequence Graphs

Authors: Chirag Jain, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru

Published in: LIPIcs, Volume 143, 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)

Abstract

Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.

Cite as

Chirag Jain, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru. Validating Paired-End Read Alignments in Sequence Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 17:1-17:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{jain_et_al:LIPIcs.WABI.2019.17,
  author =	{Jain, Chirag and Zhang, Haowen and Dilthey, Alexander and Aluru, Srinivas},
  title =	{{Validating Paired-End Read Alignments in Sequence Graphs}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{17:1--17:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.17},
  URN =		{urn:nbn:de:0030-drops-110470},
  doi =		{10.4230/LIPIcs.WABI.2019.17},
  annote =	{Keywords: Sequence graphs, read mapping, index, sparse matrix-matrix multiplication}
}

Search Results

Documents authored by Jain, Chirag

Haplotype-Aware Long-Read Error Correction

Abstract

Cite as

Co-Linear Chaining on Pangenome Graphs

Abstract

Cite as

Validating Paired-End Read Alignments in Sequence Graphs

Abstract

Cite as

Thanks for your feedback!

Could not send message