DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2024.12

Reconstructing Rearrangement Phylogenies of Natural Genomes

Authors: Leonard Bohnenkämper, Jens Stoye, and Daniel Dörr

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success. Here, we improve on that earlier work, giving a highly optimized ILP that is able to solve the SPP for sufficiently small phylogenies and gene families. A notable improvement w.r.t. the previous result is an optimized way of handling both circular and linear chromosomes. This is especially relevant to the SPP, since the chromosomal structure of ancestral genomes is unknown and the solution space for this chromosomal structure is typically large. We benchmark our method on simulated and real data. On simulated phylogenies we observe a considerable performance improvement on problems that include linear chromosomes. And even when the ground truth contains only one circular chromosome per genome, our method outperforms its predecessor due to its optimized handling of the solution space. The practical advantage becomes also visible in an analysis of seven Anopheles taxa.

Cite as

Leonard Bohnenkämper, Jens Stoye, and Daniel Dörr. Reconstructing Rearrangement Phylogenies of Natural Genomes. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{bohnenkamper_et_al:LIPIcs.WABI.2024.12,
  author =	{Bohnenk\"{a}mper, Leonard and Stoye, Jens and D\"{o}rr, Daniel},
  title =	{{Reconstructing Rearrangement Phylogenies of Natural Genomes}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{12:1--12:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.12},
  URN =		{urn:nbn:de:0030-drops-206564},
  doi =		{10.4230/LIPIcs.WABI.2024.12},
  annote =	{Keywords: genome rearrangement, ancestral reconstruction, small parsimony, integer linear programming, double-cut-and-join}
}

Document

DOI: 10.4230/LIPIcs.WABI.2024.22

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Authors: Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, and Mingfu Shao

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test.

Cite as

Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, and Mingfu Shao. Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 22:1-22:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{zang_et_al:LIPIcs.WABI.2024.22,
  author =	{Zang, Xiaofei Carl and Li, Xiang and Metcalfe, Kyle and Ben-Yehezkel, Tuval and Kelley, Ryan and Shao, Mingfu},
  title =	{{Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{22:1--22:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.22},
  URN =		{urn:nbn:de:0030-drops-206660},
  doi =		{10.4230/LIPIcs.WABI.2024.22},
  annote =	{Keywords: Genome assembly, de Bruijn graph, synthetic long reads, anchor-guided assembly, LoopSeq}
}

Document

DOI: 10.4230/LIPIcs.SEA.2024.14

Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions

Authors: Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

Given a flow network, the Minimum Flow Decomposition (MFD) problem is finding the smallest possible set of weighted paths whose superposition equals the flow. It is a classical, strongly NP-hard problem that is proven to be useful in RNA transcript assembly and applications outside of Bioinformatics. We improve an existing ILP (Integer Linear Programming) model by Dias et al. [RECOMB 2022] for DAGs by decreasing the solver’s search space using solution safety and several other optimizations. This results in a significant speedup compared to the original ILP, of up to 34× on average on the hardest instances. Moreover, we show that our optimizations apply also to MFD problem variants, resulting in speedups that go up to 219× on the hardest instances. We also developed an ILP model of reduced dimensionality for an MFD variant in which the solution path weights are restricted to a given set. This model can find an optimal MFD solution for most instances, and overall, its accuracy significantly outperforms that of previous greedy algorithms while being up to an order of magnitude faster than our optimized ILP.

Cite as

Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, and Alexandru I. Tomescu. Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 14:1-14:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{grigorjew_et_al:LIPIcs.SEA.2024.14,
  author =	{Grigorjew, Andreas and Dias, Fernando H. C. and Cracco, Andrea and Rizzi, Romeo and Tomescu, Alexandru I.},
  title =	{{Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{14:1--14:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.14},
  URN =		{urn:nbn:de:0030-drops-203792},
  doi =		{10.4230/LIPIcs.SEA.2024.14},
  annote =	{Keywords: Flow decomposition, Integer Linear Programming, Safety, RNA-seq, RNA transcript assembly, isoform}
}

Document

DOI: 10.4230/LIPIcs.WABI.2022.22

Locality-Sensitive Bucketing Functions for the Edit Distance

Authors: Ke Chen and Mingfu Shao

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Abstract

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.

Cite as

Ke Chen and Mingfu Shao. Locality-Sensitive Bucketing Functions for the Edit Distance. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 22:1-22:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{chen_et_al:LIPIcs.WABI.2022.22,
  author =	{Chen, Ke and Shao, Mingfu},
  title =	{{Locality-Sensitive Bucketing Functions for the Edit Distance}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{22:1--22:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.22},
  URN =		{urn:nbn:de:0030-drops-170563},
  doi =		{10.4230/LIPIcs.WABI.2022.22},
  annote =	{Keywords: Locality-sensitive hashing, locality-sensitive bucketing, long reads, embedding}
}

Document

DOI: 10.4230/LIPIcs.WABI.2019.15

Context-Aware Seeds for Read Mapping

Authors: Hongyi Xin, Mingfu Shao, and Carl Kingsford

Published in: LIPIcs, Volume 143, 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)

Abstract

Motivation: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results: We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mapping but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 20.3% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. Availability: https://github.com/Kingsford-Group/CAS_code

Cite as

Hongyi Xin, Mingfu Shao, and Carl Kingsford. Context-Aware Seeds for Read Mapping. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 15:1-15:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{xin_et_al:LIPIcs.WABI.2019.15,
  author =	{Xin, Hongyi and Shao, Mingfu and Kingsford, Carl},
  title =	{{Context-Aware Seeds for Read Mapping}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{15:1--15:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.15},
  URN =		{urn:nbn:de:0030-drops-110452},
  doi =		{10.4230/LIPIcs.WABI.2019.15},
  annote =	{Keywords: Read Mapping, Seed and Extend, Edit Distance, Suffix Trie}
}

5 Search Results for "Shao, Mingfu"

Reconstructing Rearrangement Phylogenies of Natural Genomes

Abstract

Cite as

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads

Abstract

Cite as

Accelerating ILP Solvers for Minimum Flow Decompositions Through Search Space and Dimensionality Reductions

Abstract

Cite as

Locality-Sensitive Bucketing Functions for the Edit Distance

Abstract

Cite as

Context-Aware Seeds for Read Mapping

Abstract

Cite as

Thanks for your feedback!

Could not send message