Volume

LIPIcs, Volume 143

19th International Workshop on Algorithms in Bioinformatics (WABI 2019)



Thumbnail PDF

Event

WABI 2019, September 8-10, 2019, Niagara Falls, NY, USA

Editors

Katharina T. Huber
  • University of East Anglia, Norwich, UK
Dan Gusfield
  • University of California, Davis, California, USA

Publication Details

  • published at: 2019-09-03
  • Publisher: Schloss-Dagstuhl - Leibniz Zentrum für Informatik
  • ISBN: 978-3-95977-123-8
  • DBLP: db/conf/wabi/wabi2019

Access Numbers

Documents

No documents found matching your filter selection.
Document
Complete Volume
LIPIcs, Volume 143, WABI'19, Complete Volume

Authors: Katharina T. Huber and Dan Gusfield


Abstract
LIPIcs, Volume 143, WABI'19, Complete Volume

Cite as

Katharina T. Huber and Dan Gusfield. LIPIcs, Volume 143, WABI'19, Complete Volume. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@Proceedings{huber_et_al:LIPIcs.WABI.2019,
  title =	{{LIPIcs, Volume 143, WABI'19, Complete Volume}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019},
  URN =		{urn:nbn:de:0030-drops-112994},
  doi =		{10.4230/LIPIcs.WABI.2019},
  annote =	{Keywords: Applied computing, Bioinformatics; Theory of computation, Design and analysis of algorithms; Mathematics of computing, Probabilistic inference problem}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Katharina T. Huber and Dan Gusfield


Abstract
Front Matter, Table of Contents, Preface, Conference Organization

Cite as

Katharina T. Huber and Dan Gusfield. Front Matter, Table of Contents, Preface, Conference Organization. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 0:i-0:xii, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{huber_et_al:LIPIcs.WABI.2019.0,
  author =	{Huber, Katharina T. and Gusfield, Dan},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{0:i--0:xii},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.0},
  URN =		{urn:nbn:de:0030-drops-110307},
  doi =		{10.4230/LIPIcs.WABI.2019.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}
Document
Building a Small and Informative Phylogenetic Supertree

Authors: Jesper Jansson, Konstantinos Mampentzidis, and Sandhya T. P.


Abstract
We combine two fundamental, previously studied optimization problems related to the construction of phylogenetic trees called maximum rooted triplets consistency (MAXRTC) and minimally resolved supertree (MINRS) into a new problem, which we call q-maximum rooted triplets consistency (q-MAXRTC). The input to our new problem is a set R of resolved triplets (rooted, binary phylogenetic trees with three leaves each) and the objective is to find a phylogenetic tree with exactly q internal nodes that contains the largest possible number of triplets from R. We first prove that q-MAXRTC is NP-hard even to approximate within a constant ratio for every fixed q >= 2, and then develop various polynomial-time approximation algorithms for different values of q. Next, we show experimentally that representing a phylogenetic tree by one having much fewer nodes typically does not destroy too much triplet branching information. As an extreme example, we show that allowing only nine internal nodes is still sufficient to capture on average 80% of the rooted triplets from some recently published trees, each having between 760 and 3081 internal nodes. Finally, to demonstrate the algorithmic advantage of using trees with few internal nodes, we propose a new algorithm for computing the rooted triplet distance between two phylogenetic trees over a leaf label set of size n that runs in O(q n) time, where q is the number of internal nodes in the smaller tree, and is therefore faster than the currently best algorithms for the problem (with O(n log n) time complexity [SODA 2013, ESA 2017]) whenever q = o(log n).

Cite as

Jesper Jansson, Konstantinos Mampentzidis, and Sandhya T. P.. Building a Small and Informative Phylogenetic Supertree. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 1:1-1:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{jansson_et_al:LIPIcs.WABI.2019.1,
  author =	{Jansson, Jesper and Mampentzidis, Konstantinos and T. P., Sandhya},
  title =	{{Building a Small and Informative Phylogenetic Supertree}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{1:1--1:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.1},
  URN =		{urn:nbn:de:0030-drops-110316},
  doi =		{10.4230/LIPIcs.WABI.2019.1},
  annote =	{Keywords: phylogenetic tree, supertree, rooted triplet, approximation algorithm}
}
Document
Alignment- and Reference-Free Phylogenomics with Colored de Bruijn Graphs

Authors: Roland Wittler


Abstract
We present a new whole-genome based approach to infer large-scale phylogenies that is alignment- and reference-free. In contrast to other methods, it does not rely on pairwise comparisons to determine distances to infer edges in a tree. Instead, a colored de Bruijn graph is constructed, and information on common subsequences is extracted to infer phylogenetic splits. Application to different datasets confirms robustness of the approach. A comparison to other state-of-the-art whole-genome based methods indicates comparable or higher accuracy and efficiency.

Cite as

Roland Wittler. Alignment- and Reference-Free Phylogenomics with Colored de Bruijn Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 2:1-2:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{wittler:LIPIcs.WABI.2019.2,
  author =	{Wittler, Roland},
  title =	{{Alignment- and Reference-Free Phylogenomics with Colored de Bruijn Graphs}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{2:1--2:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.2},
  URN =		{urn:nbn:de:0030-drops-110325},
  doi =		{10.4230/LIPIcs.WABI.2019.2},
  annote =	{Keywords: Phylogenomics, phylogenetics, phylogenetic splits, colored de Bruijn graphs}
}
Document
Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms

Authors: Nathan L. Clement


Abstract
The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), an exact solution often cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this QOI based on uncertain or noisy data, the variability (or lack of confidence) of the QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. In molecular biology applications, this is especially applicable, as the search space is generally large and observations are often noisy. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, where uncertainties arise from the explicit conversion from continuous to discrete space for protein representation (introducing some uncertainty in the input data), as well as discrete sampling of the conformations. It describes the variability that exists in existing software, and then provides a method for computing probabilistic certificates in the form of Chernoff-like bounds. Finally, this paper leverages these probabilistic certificates to accurately bound the uncertainty in docking from two docking algorithms, providing a QOI that is both robust and statistically meaningful.

Cite as

Nathan L. Clement. Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 3:1-3:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{clement:LIPIcs.WABI.2019.3,
  author =	{Clement, Nathan L.},
  title =	{{Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{3:1--3:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.3},
  URN =		{urn:nbn:de:0030-drops-110335},
  doi =		{10.4230/LIPIcs.WABI.2019.3},
  annote =	{Keywords: protein-protein docking, uncertainty quantification, protein flexibility, low-discrepancy sampling, high-dimensional sampling}
}
Document
TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees

Authors: Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow


Abstract
Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.

Cite as

Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 4:1-4:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{christensen_et_al:LIPIcs.WABI.2019.4,
  author =	{Christensen, Sarah and Molloy, Erin K. and Vachaspati, Pranjal and Warnow, Tandy},
  title =	{{TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{4:1--4:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.4},
  URN =		{urn:nbn:de:0030-drops-110347},
  doi =		{10.4230/LIPIcs.WABI.2019.4},
  annote =	{Keywords: Gene tree correction, horizontal gene transfer, incomplete lineage sorting}
}
Document
Better Practical Algorithms for rSPR Distance and Hybridization Number

Authors: Kohei Yamada, Zhi-Zhong Chen, and Lusheng Wang


Abstract
The problem of computing the rSPR distance of two phylogenetic trees (denoted by RDC) is NP-hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by HNC). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this paper, we design and implement one exact algorithm for HNC and several approximation algorithms for RDC and HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation programs output much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.

Cite as

Kohei Yamada, Zhi-Zhong Chen, and Lusheng Wang. Better Practical Algorithms for rSPR Distance and Hybridization Number. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 5:1-5:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{yamada_et_al:LIPIcs.WABI.2019.5,
  author =	{Yamada, Kohei and Chen, Zhi-Zhong and Wang, Lusheng},
  title =	{{Better Practical Algorithms for rSPR Distance and Hybridization Number}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{5:1--5:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.5},
  URN =		{urn:nbn:de:0030-drops-110355},
  doi =		{10.4230/LIPIcs.WABI.2019.5},
  annote =	{Keywords: phylogenetic tree, fixed-parameter algorithms, approximation algorithms, Monte Carlo tree search}
}
Document
pClay: A Precise Parallel Algorithm for Comparing Molecular Surfaces

Authors: Georgi D. Georgiev, Kevin F. Dodd, and Brian Y. Chen


Abstract
Comparing binding sites as geometric solids can reveal conserved features of protein structure that bind similar molecular fragments and varying features that select different partners. Due to the subtlety of these features, algorithmic efficiency and geometric precision are essential for comparison accuracy. For these reasons, this paper presents pClay, the first structure comparison algorithm to employ fine-grained parallelism to enhance both throughput and efficiency. We evaluated the parallel performance of pClay on both multicore workstation CPUs and a 61-core Xeon Phi, observing scaleable speedup in many thread configurations. Parallelism unlocked levels of precision that were not practical with existing methods. This precision has important applications, which we demonstrate: A statistical model of steric variations in binding cavities, trained with data at the level of precision typical of existing work, can overlook 46% of authentic steric influences on specificity (p <= .02). The same model, trained with more precise data from pClay, overlooked 0% using the same standard of statistical significance. These results demonstrate how enhanced efficiency and precision can advance the detection of binding mechanisms that influence specificity.

Cite as

Georgi D. Georgiev, Kevin F. Dodd, and Brian Y. Chen. pClay: A Precise Parallel Algorithm for Comparing Molecular Surfaces. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 6:1-6:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{georgiev_et_al:LIPIcs.WABI.2019.6,
  author =	{Georgiev, Georgi D. and Dodd, Kevin F. and Chen, Brian Y.},
  title =	{{pClay: A Precise Parallel Algorithm for Comparing Molecular Surfaces}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{6:1--6:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.6},
  URN =		{urn:nbn:de:0030-drops-110365},
  doi =		{10.4230/LIPIcs.WABI.2019.6},
  annote =	{Keywords: Specificity Annotation, Structure Comparison, Cavity Analysis}
}
Document
Read Mapping on Genome Variation Graphs

Authors: Kavya Vaddadi, Rajgopal Srinivasan, and Naveen Sivadasan


Abstract
Genome variation graphs are natural candidates to represent a pangenome collection. In such graphs, common subsequences are encoded as vertices and the genomic variations are captured by introducing additional labeled vertices and directed edges. Unlike a linear reference, a reference graph allows a rich representation of the genomic diversities and avoids reference bias. We address the fundamental problem of mapping reads to genome variation graphs. We give a novel mapping algorithm V-MAP for efficient identification of small subgraph of the genome graph for optimal gapped alignment of the read. V-MAP creates space efficient index using locality sensitive minimizer signatures computed using a novel graph winnowing and graph embedding onto metric space for fast and accurate mapping. Experiments involving graph constructed from the 1000 Genomes data and using both real and simulated reads show that V-MAP is fast, memory efficient and can map short reads, as well as PacBio/Nanopore long reads with high accuracy. V-MAP performance was significantly better than the state-of-the-art, especially for long reads.

Cite as

Kavya Vaddadi, Rajgopal Srinivasan, and Naveen Sivadasan. Read Mapping on Genome Variation Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{vaddadi_et_al:LIPIcs.WABI.2019.7,
  author =	{Vaddadi, Kavya and Srinivasan, Rajgopal and Sivadasan, Naveen},
  title =	{{Read Mapping on Genome Variation Graphs}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{7:1--7:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.7},
  URN =		{urn:nbn:de:0030-drops-110375},
  doi =		{10.4230/LIPIcs.WABI.2019.7},
  annote =	{Keywords: read mapping, pangenome, genome variation graphs, locality sensitive hashing}
}
Document
Finding All Maximal Perfect Haplotype Blocks in Linear Time

Authors: Jarno Alanko, Hideo Bannai, Bastien Cazaux, Pierre Peterlongo, and Jens Stoye


Abstract
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Proceedings of BSB 2019) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

Cite as

Jarno Alanko, Hideo Bannai, Bastien Cazaux, Pierre Peterlongo, and Jens Stoye. Finding All Maximal Perfect Haplotype Blocks in Linear Time. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 8:1-8:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.WABI.2019.8,
  author =	{Alanko, Jarno and Bannai, Hideo and Cazaux, Bastien and Peterlongo, Pierre and Stoye, Jens},
  title =	{{Finding All Maximal Perfect Haplotype Blocks in Linear Time}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{8:1--8:9},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.8},
  URN =		{urn:nbn:de:0030-drops-110388},
  doi =		{10.4230/LIPIcs.WABI.2019.8},
  annote =	{Keywords: Population genomics, selection coefficient, haplotype block, positional Burrows-Wheeler Transform}
}
Document
A New Paradigm for Identifying Reconciliation-Scenario Altering Mutations Conferring Environmental Adaptation

Authors: Roni Zoller, Meirav Zehavi, and Michal Ziv-Ukelson


Abstract
An important goal in microbial computational genomics is to identify crucial events in the evolution of a gene that severely alter the duplication, loss and mobilization patterns of the gene within the genomes in which it disseminates. In this paper, we formalize this microbiological goal as a new pattern-matching problem in the domain of Gene tree and Species tree reconciliation, denoted "Reconciliation-Scenario Altering Mutation (RSAM) Discovery". We propose an O(m * n * k) time algorithm to solve this new problem, where m and n are the number of vertices of the input Gene tree and Species tree, respectively, and k is a user-specified parameter that bounds from above the number of optimal solutions of interest. The algorithm first constructs a hypergraph representing the k highest scoring reconciliation scenarios between the given Gene tree and Species tree, and then interrogates this hypergraph for subtrees matching a pre-specified RSAM Pattern. Our algorithm is optimal in the sense that the number of hypernodes in the hypergraph can be lower bounded by Omega(m * n * k). We implement the new algorithm as a tool, denoted RSAM-finder, and demonstrate its application to the identification of RSAMs in toxins and drug resistance elements across a dataset spanning hundreds of species.

Cite as

Roni Zoller, Meirav Zehavi, and Michal Ziv-Ukelson. A New Paradigm for Identifying Reconciliation-Scenario Altering Mutations Conferring Environmental Adaptation. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 9:1-9:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{zoller_et_al:LIPIcs.WABI.2019.9,
  author =	{Zoller, Roni and Zehavi, Meirav and Ziv-Ukelson, Michal},
  title =	{{A New Paradigm for Identifying Reconciliation-Scenario Altering Mutations Conferring Environmental Adaptation}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{9:1--9:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.9},
  URN =		{urn:nbn:de:0030-drops-110398},
  doi =		{10.4230/LIPIcs.WABI.2019.9},
  annote =	{Keywords: Gene tree, Species tree, Reconciliation}
}
Document
Jointly Embedding Multiple Single-Cell Omics Measurements

Authors: Jie Liu, Yuanhao Huang, Ritambhara Singh, Jean-Philippe Vert, and William Stafford Noble


Abstract
Many single-cell sequencing technologies are now available, but it is still difficult to apply multiple sequencing technologies to the same single cell. In this paper, we propose an unsupervised manifold alignment algorithm, MMD-MA, for integrating multiple measurements carried out on disjoint aliquots of a given population of cells. Effectively, MMD-MA performs an in silico co-assay by embedding cells measured in different ways into a learned latent space. In the MMD-MA algorithm, single-cell data points from multiple domains are aligned by optimizing an objective function with three components: (1) a maximum mean discrepancy (MMD) term to encourage the differently measured points to have similar distributions in the latent space, (2) a distortion term to preserve the structure of the data between the input space and the latent space, and (3) a penalty term to avoid collapse to a trivial solution. Notably, MMD-MA does not require any correspondence information across data modalities, either between the cells or between the features. Furthermore, MMD-MA’s weak distributional requirements for the domains to be aligned allow the algorithm to integrate heterogeneous types of single cell measures, such as gene expression, DNA accessibility, chromatin organization, methylation, and imaging data. We demonstrate the utility of MMD-MA in simulation experiments and using a real data set involving single-cell gene expression and methylation data.

Cite as

Jie Liu, Yuanhao Huang, Ritambhara Singh, Jean-Philippe Vert, and William Stafford Noble. Jointly Embedding Multiple Single-Cell Omics Measurements. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{liu_et_al:LIPIcs.WABI.2019.10,
  author =	{Liu, Jie and Huang, Yuanhao and Singh, Ritambhara and Vert, Jean-Philippe and Noble, William Stafford},
  title =	{{Jointly Embedding Multiple Single-Cell Omics Measurements}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{10:1--10:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.10},
  URN =		{urn:nbn:de:0030-drops-110401},
  doi =		{10.4230/LIPIcs.WABI.2019.10},
  annote =	{Keywords: Manifold alignment, single-cell sequencing}
}
Document
Inferring Diploid 3D Chromatin Structures from Hi-C Data

Authors: Alexandra Gesine Cauer, Gürkan Yardımcı, Jean-Philippe Vert, Nelle Varoquaux, and William Stafford Noble


Abstract
The 3D organization of the genome plays a key role in many cellular processes, such as gene regulation, differentiation, and replication. Assays like Hi-C measure DNA-DNA contacts in a high-throughput fashion, and inferring accurate 3D models of chromosomes can yield insights hidden in the raw data. For example, structural inference can account for noise in the data, disambiguate the distinct structures of homologous chromosomes, orient genomic regions relative to nuclear landmarks, and serve as a framework for integrating other data types. Although many methods exist to infer the 3D structure of haploid genomes, inferring a diploid structure from Hi-C data is still an open problem. Indeed, the diploid case is very challenging, because Hi-C data typically does not distinguish between homologous chromosomes. We propose a method to infer 3D diploid genomes from Hi-C data. We demonstrate the accuracy of the method on simulated data, and we also use the method to infer 3D structures for mouse chromosome X, confirming that the active homolog exhibits a bipartite structure, whereas the active homolog does not.

Cite as

Alexandra Gesine Cauer, Gürkan Yardımcı, Jean-Philippe Vert, Nelle Varoquaux, and William Stafford Noble. Inferring Diploid 3D Chromatin Structures from Hi-C Data. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 11:1-11:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{cauer_et_al:LIPIcs.WABI.2019.11,
  author =	{Cauer, Alexandra Gesine and Yard{\i}mc{\i}, G\"{u}rkan and Vert, Jean-Philippe and Varoquaux, Nelle and Noble, William Stafford},
  title =	{{Inferring Diploid 3D Chromatin Structures from Hi-C Data}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{11:1--11:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.11},
  URN =		{urn:nbn:de:0030-drops-110418},
  doi =		{10.4230/LIPIcs.WABI.2019.11},
  annote =	{Keywords: Genome 3D architecture, chromatin structure, Hi-C, 3D modeling}
}
Document
Consensus Clusters in Robinson-Foulds Reticulation Networks

Authors: Alexey Markin and Oliver Eulenstein


Abstract
Inference of phylogenetic networks - the evolutionary histories of species involving speciation as well as reticulation events - has proved to be an extremely challenging problem even for smaller datasets easily tackled by supertree inference methods. An effective way to boost the scalability of distance-based supertree methods originates from the Pareto (for clusters) property, which is a highly desirable property for phylogenetic consensus methods. In particular, one can employ strict consensus merger algorithms to boost the scalability and accuracy of supertree methods satisfying Pareto; cf. SuperFine. In this work, we establish a Pareto-like property for phylogenetic networks. Then we consider the recently introduced RF-Net method that heuristically solves the so-called RF-Network problem and which was demonstrated to be an efficient and effective tool for the inference of hybridization and reassortment networks. As our main result, we provide a constructive proof (entailing an explicit refinement algorithm) that the Pareto property applies to the RF-Network problem when the solution space is restricted to the popular class of tree-child networks. This result implies that strict consensus merger strategies, similar to SuperFine, can be directly applied to boost both accuracy and scalability of RF-Net significantly. Finally, we further investigate the optimum solutions to the RF-Network problem; in particular, we describe structural properties of all optimum (tree-child) RF-networks in relation to strict consensus clusters of the input trees.

Cite as

Alexey Markin and Oliver Eulenstein. Consensus Clusters in Robinson-Foulds Reticulation Networks. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{markin_et_al:LIPIcs.WABI.2019.12,
  author =	{Markin, Alexey and Eulenstein, Oliver},
  title =	{{Consensus Clusters in Robinson-Foulds Reticulation Networks}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{12:1--12:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.12},
  URN =		{urn:nbn:de:0030-drops-110420},
  doi =		{10.4230/LIPIcs.WABI.2019.12},
  annote =	{Keywords: Phylogenetics, phylogenetic tree, phylogenetic network, reticulation network, Robinson-Foulds, Pareto, RF-Net}
}
Document
Weighted Minimum-Length Rearrangement Scenarios

Authors: Pijus Simonaitis, Annie Chateau, and Krister M. Swenson


Abstract
We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n⁴)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements.

Cite as

Pijus Simonaitis, Annie Chateau, and Krister M. Swenson. Weighted Minimum-Length Rearrangement Scenarios. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 13:1-13:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{simonaitis_et_al:LIPIcs.WABI.2019.13,
  author =	{Simonaitis, Pijus and Chateau, Annie and Swenson, Krister M.},
  title =	{{Weighted Minimum-Length Rearrangement Scenarios}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{13:1--13:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.13},
  URN =		{urn:nbn:de:0030-drops-110436},
  doi =		{10.4230/LIPIcs.WABI.2019.13},
  annote =	{Keywords: Weighted genome rearrangement, Double cut and join (DCJ), Edge switch, Minimum-weight quadrangulation}
}
Document
Fast and Accurate Structure Probability Estimation for Simultaneous Alignment and Folding of RNAs

Authors: Milad Miladi, Martin Raden, Sebastian Will, and Rolf Backofen


Abstract
Motivation: Simultaneous alignment and folding (SA&F) of RNAs is the indispensable gold standard for inferring the structure of non-coding RNAs and their general analysis. The original algorithm, proposed by Sankoff, solves the theoretical problem exactly with a complexity of O(n^6) in the full energy model. Over the last two decades, several variants and improvements of the Sankoff algorithm have been proposed to reduce its extreme complexity by proposing simplified energy models or imposing restrictions on the predicted alignments. Results: Here we introduce a novel variant of Sankoff’s algorithm that reconciles the simplifications of PMcomp, namely moving from the full energy model to a simpler base pair-based model, with the accuracy of the loop-based full energy model. Instead of estimating pseudo-energies from unconditional base pair probabilities, our model calculates energies from conditional base pair probabilities that allow to accurately capture structure probabilities, which obey a conditional dependency. Supporting modifications with surgical precision, this model gives rise to the fast and highly accurate novel algorithm Pankov (Probabilistic Sankoff-like simultaneous alignment and folding of RNAs inspired by Markov chains). Pankov benefits from the speed-up of excluding unreliable base-pairing without compromising the loop-based free energy model of the Sankoff’s algorithm. We show that Pankov outperforms its predecessors LocARNA and SPARSE in folding quality and is faster than LocARNA. Pankov is developed as a branch of the LocARNA package and available at https://github.com/mmiladi/Pankov.

Cite as

Milad Miladi, Martin Raden, Sebastian Will, and Rolf Backofen. Fast and Accurate Structure Probability Estimation for Simultaneous Alignment and Folding of RNAs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 14:1-14:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{miladi_et_al:LIPIcs.WABI.2019.14,
  author =	{Miladi, Milad and Raden, Martin and Will, Sebastian and Backofen, Rolf},
  title =	{{Fast and Accurate Structure Probability Estimation for Simultaneous Alignment and Folding of RNAs}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{14:1--14:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.14},
  URN =		{urn:nbn:de:0030-drops-110446},
  doi =		{10.4230/LIPIcs.WABI.2019.14},
  annote =	{Keywords: RNA secondary structure, Structural bioinformatics, Alignment, Algorithms}
}
Document
Context-Aware Seeds for Read Mapping

Authors: Hongyi Xin, Mingfu Shao, and Carl Kingsford


Abstract
Motivation: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows (such as in long reads with high error rate), this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results: We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mapping but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS reduces seed frequencies by up to 20.3% when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. Availability: https://github.com/Kingsford-Group/CAS_code

Cite as

Hongyi Xin, Mingfu Shao, and Carl Kingsford. Context-Aware Seeds for Read Mapping. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 15:1-15:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{xin_et_al:LIPIcs.WABI.2019.15,
  author =	{Xin, Hongyi and Shao, Mingfu and Kingsford, Carl},
  title =	{{Context-Aware Seeds for Read Mapping}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{15:1--15:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.15},
  URN =		{urn:nbn:de:0030-drops-110452},
  doi =		{10.4230/LIPIcs.WABI.2019.15},
  annote =	{Keywords: Read Mapping, Seed and Extend, Edit Distance, Suffix Trie}
}
Document
Bounded-Length Smith-Waterman Alignment

Authors: Alexander Tiskin


Abstract
Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith-Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring’s length (respectively, the sum of the two substrings' lengths) is above the given threshold w. The latter problem was introduced by Arslan and Eğecioğlu under the name "local alignment with length threshold". They proposed a dynamic programming algorithm solving the problem in time O(mn^2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith - Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others.

Cite as

Alexander Tiskin. Bounded-Length Smith-Waterman Alignment. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 16:1-16:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{tiskin:LIPIcs.WABI.2019.16,
  author =	{Tiskin, Alexander},
  title =	{{Bounded-Length Smith-Waterman Alignment}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{16:1--16:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.16},
  URN =		{urn:nbn:de:0030-drops-110461},
  doi =		{10.4230/LIPIcs.WABI.2019.16},
  annote =	{Keywords: sequence alignment, local alignment, Smith, Waterman alignment, matrix searching}
}
Document
Validating Paired-End Read Alignments in Sequence Graphs

Authors: Chirag Jain, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru


Abstract
Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.

Cite as

Chirag Jain, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru. Validating Paired-End Read Alignments in Sequence Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 17:1-17:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{jain_et_al:LIPIcs.WABI.2019.17,
  author =	{Jain, Chirag and Zhang, Haowen and Dilthey, Alexander and Aluru, Srinivas},
  title =	{{Validating Paired-End Read Alignments in Sequence Graphs}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{17:1--17:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.17},
  URN =		{urn:nbn:de:0030-drops-110470},
  doi =		{10.4230/LIPIcs.WABI.2019.17},
  annote =	{Keywords: Sequence graphs, read mapping, index, sparse matrix-matrix multiplication}
}
Document
Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem

Authors: Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford


Abstract
Transcriptomic structural variants (TSVs) - large-scale transcriptome sequence change due to structural variation - are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangement Problem (MCAP), which seeks k genome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a 1/4-approximation algorithm for k=1 and a 3/4-approximation algorithm for the diploid case (k=2) assuming an oracle for k=1. Combining these, we obtain a 3/16-approximation algorithm for MCAP when k=2 (without an oracle). We also present an integer linear programming formulation for general k. We characterize the graph structures that require k>1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance compared to the state-of-the-art TSV-calling tool, SQUID.

Cite as

Yutong Qiu, Cong Ma, Han Xie, and Carl Kingsford. Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 18:1-18:5, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{qiu_et_al:LIPIcs.WABI.2019.18,
  author =	{Qiu, Yutong and Ma, Cong and Xie, Han and Kingsford, Carl},
  title =	{{Detecting Transcriptomic Structural Variants in Heterogeneous Contexts via the Multiple Compatible Arrangements Problem}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{18:1--18:5},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.18},
  URN =		{urn:nbn:de:0030-drops-110483},
  doi =		{10.4230/LIPIcs.WABI.2019.18},
  annote =	{Keywords: transcriptomic structural variation, integer linear programming, heterogeneity}
}
Document
Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster

Authors: Qi Wang, R. A. Leo Elworth, Tian Rui Liu, and Todd J. Treangen


Abstract
As sequence databases grow, characterizing diversity across extremely large collections of genomes requires the development of efficient methods that avoid costly all-vs-all comparisons [Marschall et al., 2018]. In addition to exponential increases in the amount of natural genomes being sequenced, improved techniques for the creation of human engineered sequences is ushering in a new wave of synthetic genome sequence databases that grow alongside naturally occurring genome databases. In this paper, we analyze the full diversity of available sequenced natural and synthetic plasmid genome sequences. This diversity can be represented by a data structure that captures all presently available nucleotide sequences, known as a pan-genome. In our case, we construct a single linear pan-genome nucleotide sequence that captures this diversity. To process such a large number of sequences, we introduce the plaster algorithmic pipeline. Using plaster we are able to construct the full synthetic plasmid pan-genome from 51,047 synthetic plasmid sequences as well as a natural pan-genome from 6,642 natural plasmid sequences. We demonstrate the efficacy of plaster by comparing its speed against another pan-genome construction method as well as demonstrating that nearly all plasmids align well to their corresponding pan-genome. Finally, we explore the use of pan-genome sequence alignment to distinguish between naturally occurring and synthetic plasmids. We believe this approach will lead to new techniques for rapid characterization of engineered plasmids. Applications for this work include detection of genome editing, tracking an unknown plasmid back to its lab of origin, and identifying naturally occurring sequences that may be of use to the synthetic biology community. The source code for fully reconstructing the natural and synthetic plasmid pan-genomes as well for plaster are publicly available and can be downloaded at https://gitlab.com/qiwangrice/plaster.git.

Cite as

Qi Wang, R. A. Leo Elworth, Tian Rui Liu, and Todd J. Treangen. Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 19:1-19:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{wang_et_al:LIPIcs.WABI.2019.19,
  author =	{Wang, Qi and Elworth, R. A. Leo and Liu, Tian Rui and Treangen, Todd J.},
  title =	{{Faster Pan-Genome Construction for Efficient Differentiation of Naturally Occurring and Engineered Plasmids with Plaster}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{19:1--19:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.19},
  URN =		{urn:nbn:de:0030-drops-110492},
  doi =		{10.4230/LIPIcs.WABI.2019.19},
  annote =	{Keywords: comparative genomics, sequence alignment, pan-genome, engineered plasmids}
}
Document
Rapidly Computing the Phylogenetic Transfer Index

Authors: Jakub Truszkowski, Olivier Gascuel, and Krister M. Swenson


Abstract
Given trees T and T_o on the same taxon set, the transfer index phi(b,T_o) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T_o. Recently, Lemoine et al. [Lemoine et al., 2018] used the transfer index to design a novel bootstrap analysis technique that improves on Felsenstein’s bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b in T in O(n log^3 n) time, which improves upon the current O(n^2)-time algorithm by Lin, Rajan and Moret [Lin et al., 2012]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors).

Cite as

Jakub Truszkowski, Olivier Gascuel, and Krister M. Swenson. Rapidly Computing the Phylogenetic Transfer Index. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 20:1-20:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{truszkowski_et_al:LIPIcs.WABI.2019.20,
  author =	{Truszkowski, Jakub and Gascuel, Olivier and Swenson, Krister M.},
  title =	{{Rapidly Computing the Phylogenetic Transfer Index}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{20:1--20:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.20},
  URN =		{urn:nbn:de:0030-drops-110505},
  doi =		{10.4230/LIPIcs.WABI.2019.20},
  annote =	{Keywords: large phylogenies, bootstrap analysis, tree comparison, data structures on trees}
}
Document
Empirical Performance of Tree-Based Inference of Phylogenetic Networks

Authors: Zhen Cao, Jiafan Zhu, and Luay Nakhleh


Abstract
Phylogenetic networks extend the phylogenetic tree structure and allow for modeling vertical and horizontal evolution in a single framework. Statistical inference of phylogenetic networks is prohibitive and currently limited to small networks. An approach that could significantly improve phylogenetic network space exploration is based on first inferring an evolutionary tree of the species under consideration, and then augmenting the tree into a network by adding a set of "horizontal" edges to better fit the data. In this paper, we study the performance of such an approach on networks generated under a birth-hybridization model and explore its feasibility as an alternative to approaches that search the phylogenetic network space directly (without relying on a fixed underlying tree). We find that the concatenation method does poorly at obtaining a "backbone" tree that could be augmented into the correct network, whereas the popular species tree inference method ASTRAL does significantly better at such a task. We then evaluated the tree-to-network augmentation phase under the minimizing deep coalescence and pseudo-likelihood criteria. We find that even though this is a much faster approach than the direct search of the network space, the accuracy is much poorer, even when the backbone tree is a good starting tree. Our results show that tree-based inference of phylogenetic networks could yield very poor results. As exploration of the network space directly in search of maximum likelihood estimates or a representative sample of the posterior is very expensive, significant improvements to the computational complexity of phylogenetic network inference are imperative if analyses of large data sets are to be performed. We show that a recently developed divide-and-conquer approach significantly outperforms tree-based inference in terms of accuracy, albeit still at a higher computational cost.

Cite as

Zhen Cao, Jiafan Zhu, and Luay Nakhleh. Empirical Performance of Tree-Based Inference of Phylogenetic Networks. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{cao_et_al:LIPIcs.WABI.2019.21,
  author =	{Cao, Zhen and Zhu, Jiafan and Nakhleh, Luay},
  title =	{{Empirical Performance of Tree-Based Inference of Phylogenetic Networks}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{21:1--21:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.21},
  URN =		{urn:nbn:de:0030-drops-110510},
  doi =		{10.4230/LIPIcs.WABI.2019.21},
  annote =	{Keywords: Phylogenetic networks, species tree, tree-based networks, multi-locus phylogeny}
}
Document
A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference

Authors: Mohammadamin Edrisi, Hamim Zafar, and Luay Nakhleh


Abstract
Single-cell sequencing provides a powerful approach for elucidating intratumor heterogeneity by resolving cell-to-cell variability. However, it also poses additional challenges including elevated error rates, allelic dropout and non-uniform coverage. A recently introduced single-cell-specific mutation detection algorithm leverages the evolutionary relationship between cells for denoising the data. However, due to its probabilistic nature, this method does not scale well with the number of cells. Here, we develop a novel combinatorial approach for utilizing the genealogical relationship of cells in detecting mutations from noisy single-cell sequencing data. Our method, called scVILP, jointly detects mutations in individual cells and reconstructs a perfect phylogeny among these cells. We employ a novel Integer Linear Program algorithm for deterministically and efficiently solving the joint inference problem. We show that scVILP achieves similar or better accuracy but significantly better runtime over existing methods on simulated data. We also applied scVILP to an empirical human cancer dataset from a high grade serous ovarian cancer patient.

Cite as

Mohammadamin Edrisi, Hamim Zafar, and Luay Nakhleh. A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 22:1-22:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{edrisi_et_al:LIPIcs.WABI.2019.22,
  author =	{Edrisi, Mohammadamin and Zafar, Hamim and Nakhleh, Luay},
  title =	{{A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{22:1--22:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.22},
  URN =		{urn:nbn:de:0030-drops-110525},
  doi =		{10.4230/LIPIcs.WABI.2019.22},
  annote =	{Keywords: Mutation calling, Single-cell sequencing, Integer linear programming, Perfect phylogeny}
}
Document
Topological Data Analysis Reveals Principles of Chromosome Structure in Cellular Differentiation

Authors: Natalie Sauerwald, Yihang Shen, and Carl Kingsford


Abstract
Topological data analysis (TDA) is a mathematically well-founded set of methods to derive robust information about the structure and topology of data. It has been applied successfully in several biological contexts. Derived primarily from algebraic topology, TDA rigorously identifies persistent features in complex data, making it well-suited to better understand the key features of three-dimensional chromosome structure. Chromosome structure has a significant influence in many diverse genomic processes and has recently been shown to relate to cellular differentiation. While there exist many methods to study specific substructures of chromosomes, we are still missing a global view of all geometric features of chromosomes. By applying TDA to the study of chromosome structure through differentiation across three cell lines, we provide insight into principles of chromosome folding and looping. We identify persistent connected components and one-dimensional topological features of chromosomes and characterize them across cell types and stages of differentiation. Availability: Scripts to reproduce the results from this study can be found at https://github.com/Kingsford-Group/hictda

Cite as

Natalie Sauerwald, Yihang Shen, and Carl Kingsford. Topological Data Analysis Reveals Principles of Chromosome Structure in Cellular Differentiation. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 23:1-23:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{sauerwald_et_al:LIPIcs.WABI.2019.23,
  author =	{Sauerwald, Natalie and Shen, Yihang and Kingsford, Carl},
  title =	{{Topological Data Analysis Reveals Principles of Chromosome Structure in Cellular Differentiation}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{23:1--23:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.23},
  URN =		{urn:nbn:de:0030-drops-110537},
  doi =		{10.4230/LIPIcs.WABI.2019.23},
  annote =	{Keywords: topological data analysis, chromosome structure, Hi-C, topologically associating domains}
}
Document
Synteny Paths for Assembly Graphs Comparison

Authors: Evgeny Polevikov and Mikhail Kolmogorov


Abstract
Despite the recent developments of long-read sequencing technologies, it is still difficult to produce complete assemblies of eukaryotic genomes in an automated fashion. Genome assembly software typically output assembled fragments (contigs) along with assembly graphs, that encode all possible layouts of these contigs. Graph representation of the assembled genome can be useful for gene discovery, haplotyping, structural variations analysis and other applications. To facilitate the development of new graph-based approaches, it is important to develop algorithms for comparison and evaluation of assembly graphs produced by different software. In this work, we introduce synteny paths: maximal paths of homologous sequence between the compared assembly graphs. We describe Asgan - an algorithm for efficient synteny paths decomposition, and use it to evaluate assembly graphs of various bacterial assemblies produced by different approaches. We then apply Asgan to discover structural variations between the assemblies of 15 Drosophila genomes, and show that synteny paths are robust to contig fragmentation. The Asgan tool is freely available at: https://github.com/epolevikov/Asgan.

Cite as

Evgeny Polevikov and Mikhail Kolmogorov. Synteny Paths for Assembly Graphs Comparison. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 24:1-24:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Copy BibTex To Clipboard

@InProceedings{polevikov_et_al:LIPIcs.WABI.2019.24,
  author =	{Polevikov, Evgeny and Kolmogorov, Mikhail},
  title =	{{Synteny Paths for Assembly Graphs Comparison}},
  booktitle =	{19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
  pages =	{24:1--24:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-123-8},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{143},
  editor =	{Huber, Katharina T. and Gusfield, Dan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.24},
  URN =		{urn:nbn:de:0030-drops-110545},
  doi =		{10.4230/LIPIcs.WABI.2019.24},
  annote =	{Keywords: Assembly graphs, Genome assembly, Synteny blocks, Comparative Genomics}
}

Filters


Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail