LIPIcs, Volume 88

17th International Workshop on Algorithms in Bioinformatics (WABI 2017)



Thumbnail PDF

Event

WABI 2017, August 21-23, 2017, Boston, MA, USA

Editors

Russell Schwartz
Knut Reinert

Publication Details

  • published at: 2017-08-11
  • Publisher: Schloss Dagstuhl – Leibniz-Zentrum für Informatik
  • ISBN: 978-3-95977-050-7
  • DBLP: db/conf/wabi/wabi2017

Access Numbers

Documents

No documents found matching your filter selection.
Document
Complete Volume
LIPIcs, Volume 88, WABI'17, Complete Volume

Authors: Russell Schwartz and Knut Reinert


Abstract
LIPIcs, Volume 88, WABI'17, Complete Volume

Cite as

17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@Proceedings{schwartz_et_al:LIPIcs.WABI.2017,
  title =	{{LIPIcs, Volume 88, WABI'17, Complete Volume}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017},
  URN =		{urn:nbn:de:0030-drops-78120},
  doi =		{10.4230/LIPIcs.WABI.2017},
  annote =	{Keywords: Nonnumerical Algorithms and Problems, Pattern Matching, Algorithms, Life and Medical Sciences}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, List of Authors

Authors: Russell Schwartz and Knut Reinert


Abstract
Front Matter, Table of Contents, Preface, List of Authors

Cite as

17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 0:i-0:xiv, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{schwartz_et_al:LIPIcs.WABI.2017.0,
  author =	{Schwartz, Russell and Reinert, Knut},
  title =	{{Front Matter, Table of Contents, Preface, List of Authors}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{0:i--0:xiv},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.0},
  URN =		{urn:nbn:de:0030-drops-76348},
  doi =		{10.4230/LIPIcs.WABI.2017.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, List of Authors}
}
Document
Disentangled Long-Read De Bruijn Graphs via Optical Maps

Authors: Bahar Alipanahi, Leena Salmela, Simon J. Puglisi, Martin Muggli, and Christina Boucher


Abstract
While long reads produced by third-generation sequencing technology from, e.g, Pacific Biosciences have been shown to increase the quality of draft genomes in repetitive regions, fundamental computational challenges remain in overcoming their high error rate and assembling them efficiently. In this paper we show that the de Bruijn graph built on the long reads can be efficiently and substantially disentangled using optical mapping data as auxiliary information. Fundamental to our approach is the use of the positional de Bruijn graph and a succinct data structure for constructing and traversing this graph. Our experimental results show that over 97.7% of directed cycles have been removed from the resulting positional de Bruijn graph as compared to its non-positional counterpart. Our results thus indicate that disentangling the de Bruijn graph using positional information is a promising direction for developing a simple and efficient assembly algorithm for long reads.

Cite as

Bahar Alipanahi, Leena Salmela, Simon J. Puglisi, Martin Muggli, and Christina Boucher. Disentangled Long-Read De Bruijn Graphs via Optical Maps. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 1:1-1:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{alipanahi_et_al:LIPIcs.WABI.2017.1,
  author =	{Alipanahi, Bahar and Salmela, Leena and Puglisi, Simon J. and Muggli, Martin and Boucher, Christina},
  title =	{{Disentangled Long-Read De Bruijn Graphs via Optical Maps}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{1:1--1:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.1},
  URN =		{urn:nbn:de:0030-drops-76614},
  doi =		{10.4230/LIPIcs.WABI.2017.1},
  annote =	{Keywords: Positional de Bruijn graph, Genome Assembly, Long Read Data, Optical maps}
}
Document
Gene Tree Parsimony for Incomplete Gene Trees

Authors: Md. Shamsuzzoha Bayzid and Tandy Warnow


Abstract
Species tree estimation from gene trees can be complicated by gene duplication and loss, and "gene tree parsimony" (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some incomplete gene trees (i.e., gene trees lacking one or more of the species). We present new theory for GTP considering whether the incompleteness is due to gene birth and death (i.e., true biological loss) or taxon sampling, and present dynamic programming algorithms that can be used for an exact but exponential time solution for small numbers of taxa, or as a heuristic for larger numbers of taxa. We also prove that the "standard" calculations for duplications and losses exactly solve GTP when incompleteness results from taxon sampling, although they can be incorrect when incompleteness results from true biological loss. The software for the DP algorithm is freely available as open source code at https://github.com/shamsbayzid/DynaDup.

Cite as

Md. Shamsuzzoha Bayzid and Tandy Warnow. Gene Tree Parsimony for Incomplete Gene Trees. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 2:1-2:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{bayzid_et_al:LIPIcs.WABI.2017.2,
  author =	{Bayzid, Md. Shamsuzzoha and Warnow, Tandy},
  title =	{{Gene Tree Parsimony for Incomplete Gene Trees}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{2:1--2:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.2},
  URN =		{urn:nbn:de:0030-drops-76495},
  doi =		{10.4230/LIPIcs.WABI.2017.2},
  annote =	{Keywords: Gene duplication and loss, gene tree parsimony, deep coalescence}
}
Document
Better Greedy Sequence Clustering with Fast Banded Alignment

Authors: Brian Brubach, Jay Ghurye, Mihai Pop, and Aravind Srinivasan


Abstract
Comparing a string to a large set of sequences is a key subroutine in greedy heuristics for clustering genomic data. Clustering 16S rRNA gene sequences into operational taxonomic units (OTUs) is a common method used in studying microbial communities. We present a new approach to greedy clustering using a trie-like data structure and Four Russians speedup. We evaluate the running time of our method in terms of the number of comparisons it makes during clustering and show in experimental results that the number of comparisons grows linearly with the size of the dataset as opposed to the quadratic running time of other methods. We compare the clusters output by our method to the popular greedy clustering tool UCLUST. We show that the clusters we generate can be both tighter and larger.

Cite as

Brian Brubach, Jay Ghurye, Mihai Pop, and Aravind Srinivasan. Better Greedy Sequence Clustering with Fast Banded Alignment. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 3:1-3:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{brubach_et_al:LIPIcs.WABI.2017.3,
  author =	{Brubach, Brian and Ghurye, Jay and Pop, Mihai and Srinivasan, Aravind},
  title =	{{Better Greedy Sequence Clustering with Fast Banded Alignment}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{3:1--3:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.3},
  URN =		{urn:nbn:de:0030-drops-76425},
  doi =		{10.4230/LIPIcs.WABI.2017.3},
  annote =	{Keywords: Sequence Clustering, Metagenomics, String Algorithms}
}
Document
Optimal Computation of Overabundant Words

Authors: Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos


Abstract
The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

Cite as

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. Optimal Computation of Overabundant Words. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{almirantis_et_al:LIPIcs.WABI.2017.4,
  author =	{Almirantis, Yannis and Charalampopoulos, Panagiotis and Gao, Jia and Iliopoulos, Costas S. and Mohamed, Manal and Pissis, Solon P. and Polychronopoulos, Dimitris},
  title =	{{Optimal Computation of Overabundant Words}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{4:1--4:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.4},
  URN =		{urn:nbn:de:0030-drops-76468},
  doi =		{10.4230/LIPIcs.WABI.2017.4},
  annote =	{Keywords: overabundant words, avoided words, suffix tree, DNA sequence analysis}
}
Document
Detecting Locus Acquisition Events in Gene Trees

Authors: Michal Aleksander Ciach, Anna Muszewska, and Pawel Górecki


Abstract
Horizontal Gene Transfer (HGT), a process of acquisition and fixation of foreign genetic material, is an important biological phenomenon. Several approaches to HGT inference have been proposed. However, most of them either rely on approximate, non-phylogenetic methods or on the tree reconciliation, which is computationally intensive and sensitive to parameter values. In this work, we investigate the Locus Tree Inference problem as a possible alternative that combines the advantages of both approaches. We show several algorithms to solve the problem in the parsimony framework. We introduce a novel tree mapping, which allows us to obtain a heuristic solution to the problems of locus tree inference and duplication classification. Our approach allows not only for faster comparisons of gene and species trees but also to improve known algorithms for duplication inference in the presence of polytomies in the species trees.

Cite as

Michal Aleksander Ciach, Anna Muszewska, and Pawel Górecki. Detecting Locus Acquisition Events in Gene Trees. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 5:1-5:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{ciach_et_al:LIPIcs.WABI.2017.5,
  author =	{Ciach, Michal Aleksander and Muszewska, Anna and G\'{o}recki, Pawel},
  title =	{{Detecting Locus Acquisition Events in Gene Trees}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{5:1--5:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.5},
  URN =		{urn:nbn:de:0030-drops-76545},
  doi =		{10.4230/LIPIcs.WABI.2017.5},
  annote =	{Keywords: rank, taxon, ranked species tree, speciation, gene duplication, gene loss, horizontal gene transfer}
}
Document
An IP Algorithm for RNA Folding Trajectories

Authors: Amir H. Bayegan and Peter Clote


Abstract
Vienna RNA Package software Kinfold implements the Gillespie algorithm for RNA secondary structure folding kinetics, for the move sets MS1 [resp. MS2], consisting of base pair additions and removals [resp. base pair addition, removals and shifts]. In this paper, for arbitrary secondary structures s, t of a given RNA sequence, we present the first optimal algorithm to compute the shortest MS2 folding trajectory s = s0, s1, . . . , sm = t, where each intermediate structure si+1 is obtained from its predecessor by the addition, removal or shift of a single base pair. The shortest MS1 trajectory between s and t is trivially equal to the number of base pairs belonging to s but not t, plus the number of base pairs belonging to t but not s. Our optimal algorithm applies integer programming (IP) to solve (essentially) the minimum feedback vertex set (FVS) problem for the "conflict digraph" associated with input secondary structures s, t, and then applies topological sort, in order to generate an optimal MS2 folding pathway from s to t that maximizes the use of shift moves. Since the optimal algorithm may require excessive run time, we also sketch a fast, near-optimal algorithm (details to appear elsewhere). Software for our algorithm will be publicly available at http://bioinformatics.bc.edu/clotelab/MS2distance/.

Cite as

Amir H. Bayegan and Peter Clote. An IP Algorithm for RNA Folding Trajectories. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 6:1-6:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{bayegan_et_al:LIPIcs.WABI.2017.6,
  author =	{Bayegan, Amir H. and Clote, Peter},
  title =	{{An IP Algorithm for RNA Folding Trajectories}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{6:1--6:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.6},
  URN =		{urn:nbn:de:0030-drops-76437},
  doi =		{10.4230/LIPIcs.WABI.2017.6},
  annote =	{Keywords: Integer programming, RNA secondary structure, folding trajectory, feedback vertex problem, conflict digraph}
}
Document
Fast Spaced Seed Hashing

Authors: Samuele Girotto, Matteo Comin, and Cinzia Pizzi


Abstract
Hashing k-mers is a common function across many bioinformatics applications and it is widely used for indexing, querying and rapid similarity search. Recently, spaced seeds, a special type of pattern that accounts for errors or mutations, are routinely used instead of k-mers. Spaced seeds allow to improve the sensitivity, with respect to k-mers, in many applications, however the hashing of spaced seeds increases substantially the computational time. Hence, the ability to speed up hashing operations of spaced seeds would have a major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. In this paper we address the problem of efficient spaced seed hashing. The proposed algorithm exploits the similarity of adjacent spaced seed hash values in an input sequence in order to efficiently compute the next hash. We report a series of experiments on NGS reads hashing using several spaced seeds. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6x to 5.3x, depending on the structure of the spaced seed.

Cite as

Samuele Girotto, Matteo Comin, and Cinzia Pizzi. Fast Spaced Seed Hashing. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{girotto_et_al:LIPIcs.WABI.2017.7,
  author =	{Girotto, Samuele and Comin, Matteo and Pizzi, Cinzia},
  title =	{{Fast Spaced Seed Hashing}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{7:1--7:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.7},
  URN =		{urn:nbn:de:0030-drops-76501},
  doi =		{10.4230/LIPIcs.WABI.2017.7},
  annote =	{Keywords: k-mers, spaced seeds, efficient hashing}
}
Document
A General Framework for Gene Tree Correction Based on Duplication-Loss Reconciliation

Authors: Nadia El-Mabrouk and Aïda Ouangraoua


Abstract
Due to the key role played by gene trees and species phylogenies in biological studies, it is essential to have as much confidence as possible on the available trees. As phylogenetic tools are error-prone, it is a common task to use a correction method for improving an initial tree. Various correction methods exist. In this paper we focus on those based on the Duplication-Loss reconciliation model. The polytomy resolution approach consists in contracting weakly supported branches and then refining the obtained non-binary tree in a way minimizing a reconciliation distance with the given species tree. On the other hand, the supertree approach takes as input a set of separated subtrees, either obtained for separared orthology groups or by removing the upper branches of an initial tree to a certain level, and amalgamating them in an optimal way preserving the topology of the initial trees. The two classes of problems have always been considered as two separate fields, based on apparently different models. In this paper we give a unifying view showing that these two classes of problems are in fact special cases of a more general problem that we call LabelGTC, whose input includes a 0-1 edge-labelled gene tree to be corrected. Considering a tree as a set of triplets, we also formulate the TripletGTC Problem whose input includes a set of gene triplets that should be preserved in the corrected tree. These two general models allow to unify, understand and compare the principles of the duplication-loss reconciliation-based tree correction approaches. We show that LabelGTC is a special case of TripletGTC. We then develop appropriate algorithms allowing to handle these two general correction problems.

Cite as

Nadia El-Mabrouk and Aïda Ouangraoua. A General Framework for Gene Tree Correction Based on Duplication-Loss Reconciliation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 8:1-8:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{elmabrouk_et_al:LIPIcs.WABI.2017.8,
  author =	{El-Mabrouk, Nadia and Ouangraoua, A\"{i}da},
  title =	{{A General Framework for Gene Tree Correction Based on Duplication-Loss Reconciliation}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{8:1--8:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.8},
  URN =		{urn:nbn:de:0030-drops-76565},
  doi =		{10.4230/LIPIcs.WABI.2017.8},
  annote =	{Keywords: Gene tree correction, Supertree, Polytomy, Reconciliation, Phylogeny}
}
Document
Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time

Authors: Maxime Crochemore, Alexandre P. Francisco, Solon P. Pissis, and Cátia Vaz


Abstract
Computing genetic evolution distances among a set of taxa dominates the running time of many phylogenetic inference methods. Most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles. We propose here an average-case linear-time algorithm to compute pairwise Hamming distances among a set of taxa under a given Hamming distance threshold. This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method.

Cite as

Maxime Crochemore, Alexandre P. Francisco, Solon P. Pissis, and Cátia Vaz. Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 9:1-9:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{crochemore_et_al:LIPIcs.WABI.2017.9,
  author =	{Crochemore, Maxime and Francisco, Alexandre P. and Pissis, Solon P. and Vaz, C\'{a}tia},
  title =	{{Towards Distance-Based Phylogenetic Inference in Average-Case Linear-Time}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{9:1--9:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.9},
  URN =		{urn:nbn:de:0030-drops-76529},
  doi =		{10.4230/LIPIcs.WABI.2017.9},
  annote =	{Keywords: computational biology, phylogenetic inference, Hamming distance}
}
Document
Yanagi: Transcript Segment Library Construction for RNA-Seq Quantification

Authors: Mohamed K. Gunady, Steffen Cornwell, Stephen M. Mount, and Héctor Corrada Bravo


Abstract
Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, and that annotated transcripts from a given gene are often a small subset of many possible complete transcripts for that gene. Here we describe Yanagi, a tool which segments a transcriptome into disjoint regions to create a segments library from a complete transcriptome annotation that preserves all of its consecutive regions of a given length L while distinguishing annotated alternative splicing events in the transcriptome. In this paper, we formalize this concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries based on a length parameter dependent on specific RNA-Seq library construction. The resulting segment sequences can be used with pseudo-alignment tools to quantify expression at the segment level. We characterize the segment libraries for the reference transcriptomes of Drosophila melanogaster and Homo sapiens. Finally, we demonstrate the utility of quantification using a segment library based on an analysis of differential exon skipping in Drosophila melanogaster and Homo sapiens. The notion of transcript segmentation as introduced here and implemented in Yanagi will open the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of analyses of transcription variation.

Cite as

Mohamed K. Gunady, Steffen Cornwell, Stephen M. Mount, and Héctor Corrada Bravo. Yanagi: Transcript Segment Library Construction for RNA-Seq Quantification. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 10:1-10:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{gunady_et_al:LIPIcs.WABI.2017.10,
  author =	{Gunady, Mohamed K. and Cornwell, Steffen and Mount, Stephen M. and Bravo, H\'{e}ctor Corrada},
  title =	{{Yanagi: Transcript Segment Library Construction for RNA-Seq Quantification}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{10:1--10:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.10},
  URN =		{urn:nbn:de:0030-drops-76487},
  doi =		{10.4230/LIPIcs.WABI.2017.10},
  annote =	{Keywords: RNA-Seq, Genome Sequencing, Kmer-based alignment, Transcriptome Quantification, Differential Alternative Splicing}
}
Document
Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications

Authors: Chenyue W. Hu, Hanyang Li, and Amina A. Qutub


Abstract
Motivation: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion, in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis. Results: We introduce Shrinkage Clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed in application to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Given its ease of implementation, computing efficiency and extensible structure, we believe Shrinkage Clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.

Cite as

Chenyue W. Hu, Hanyang Li, and Amina A. Qutub. Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 11:1-11:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{hu_et_al:LIPIcs.WABI.2017.11,
  author =	{Hu, Chenyue W. and Li, Hanyang and Qutub, Amina A.},
  title =	{{Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{11:1--11:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.11},
  URN =		{urn:nbn:de:0030-drops-76556},
  doi =		{10.4230/LIPIcs.WABI.2017.11},
  annote =	{Keywords: Clustering, Matrix Factorization, Cancer Subtyping, Gene Expression}
}
Document
Sparsification Enables Predicting Kissing Hairpin Pseudoknot Structures of Long RNAs in Practice

Authors: Hosna Jabbari, Ian Wark, Carlo Montemagno, and Sebastian Will


Abstract
While computational RNA secondary structure prediction is an important tool in RNA research, it is still fundamentally limited to pseudoknot-free structures (or at best very simple pseudoknots) in practice. Here, we make the prediction of complex pseudoknots - including kissing hairpin structures - practically applicable by reducing the originally high space consumption. For this aim, we apply the technique of sparsification and other space-saving modifications to the recurrences of the pseudoknot prediction algorithm by Chen, Condon and Jabbari (CCJ algorithm). Thus, the theoretical space complexity of free energy minimization is reduced to Theta(n^3+Z), in the sequence length n and the number of non-optimally decomposable fragments ("candidates") Z. The sparsified CCJ algorithm, sparseCCJ, is presented in detail. Moreover, we provide and compare three generations of CCJ implementations, which continuously improve the space requirements: the original CCJ implementation, our first modified implementation, and our final sparsified implementation. The two latest implementations implement the established HotKnots DP09 energy model. In our experiments, using 244GB of RAM, the original CCJ implementation failed to handle sequences longer than 195 bases; sparseCCJ handles our pseudoknot data set (up to about length 400 bases) in this space limit. All three CCJ implementations are available at https://github.com/HosnaJabbari/CCJ.

Cite as

Hosna Jabbari, Ian Wark, Carlo Montemagno, and Sebastian Will. Sparsification Enables Predicting Kissing Hairpin Pseudoknot Structures of Long RNAs in Practice. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 12:1-12:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{jabbari_et_al:LIPIcs.WABI.2017.12,
  author =	{Jabbari, Hosna and Wark, Ian and Montemagno, Carlo and Will, Sebastian},
  title =	{{Sparsification Enables Predicting Kissing Hairpin Pseudoknot Structures of Long RNAs in Practice}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{12:1--12:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.12},
  URN =		{urn:nbn:de:0030-drops-76408},
  doi =		{10.4230/LIPIcs.WABI.2017.12},
  annote =	{Keywords: RNA, secondary structure prediction, pseudoknots, space efficiency, sparsification}
}
Document
Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence

Authors: Jongkyu Kim and Knut Reinert


Abstract
Motivation: Comprehensive identification of structural variations (SVs) is a crucial task for studying genetic diversity and diseases. However, it remains challenging. There is only a marginal consensus between different methods, and our understanding of SVs is substantially limited.In general, integration of multiple pieces of evidence including split-read, read-pair, soft-clip, and read-depth yields the best result regarding accuracy. However, doing this step by step is usually cumbersome and computationally expensive. Result: We present Vaquita, an accurate and fast tool for the identification of structural variations, which leverages all four types of evidence in a single program. After merging SVs from split-reads and discordant read-pairs, Vaquita realigns the soft-clipped reads to the selected regions using a fast bit-vector algorithm. Furthermore, it also considers the discrepancy of depth distribution around breakpoints using Kullback-Leibler divergence. Finally, Vaquita provides an additional metric for candidate selection based on voting, and also provides robust prioritization based on rank aggregation. We show that Vaquita is robust in terms of sequencing coverage, insertion size of the library, and read length, and is comparable or even better for the identification of deletions, inversions, duplications, and translocations than state-of-the-art tools, using both simulated and real datasets. In addition, Vaquita is more than eight times faster than any other tools in comparison. Availability: Vaquita is implemented in C++ using the SeqAn library. The source code is distributed under the BSD license and can be downloaded at http://github.com/seqan/vaquita

Cite as

Jongkyu Kim and Knut Reinert. Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 13:1-13:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{kim_et_al:LIPIcs.WABI.2017.13,
  author =	{Kim, Jongkyu and Reinert, Knut},
  title =	{{Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{13:1--13:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.13},
  URN =		{urn:nbn:de:0030-drops-76352},
  doi =		{10.4230/LIPIcs.WABI.2017.13},
  annote =	{Keywords: Structural variation}
}
Document
Assessing the Significance of Peptide Spectrum Match Scores

Authors: Anastasiia Abramova and Anton Korobeynikov


Abstract
Peptidic Natural Products (PNPs) are highly sought after bioactive compounds that include many antibiotic, antiviral and antitumor agents, immunosuppressors and toxins. Even though recent advancements in mass-spectrometry have led to the development of accurate sequencing methods for nonlinear (cyclic and branch-cyclic) peptides, requiring only picograms of input material, the identification of PNPs via a database search of mass spectra remains problematic. This holds particularly true when trying to evaluate the statistical significance of Peptide Spectrum Matches (PSM) especially when working with non-linear peptides that often contain non-standard amino acids, modifications and have an overall complex structure. In this paper we describe a new way of estimating the statistical significance of a PSM, defined by any peptide (including linear and non-linear), by using state-of-the-art Markov Chain Monte Carlo methods. In addition to the estimate itself our method also provides an uncertainty estimate in the form of confidence bounds, as well as an automatic simulation stopping rule that ensures that the sample size is sufficient to achieve the desired level of result accuracy.

Cite as

Anastasiia Abramova and Anton Korobeynikov. Assessing the Significance of Peptide Spectrum Match Scores. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 14:1-14:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{abramova_et_al:LIPIcs.WABI.2017.14,
  author =	{Abramova, Anastasiia and Korobeynikov, Anton},
  title =	{{Assessing the Significance of Peptide Spectrum Match Scores}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{14:1--14:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.14},
  URN =		{urn:nbn:de:0030-drops-76411},
  doi =		{10.4230/LIPIcs.WABI.2017.14},
  annote =	{Keywords: mass spectrometry, natural products, peptide spectrum matches, statistical significance}
}
Document
abSNP: RNA-Seq SNP Calling in Repetitive Regions via Abundance Estimation

Authors: Shunfu Mao, Soheil Mohajer, Kannan Ramachandran, David Tse, and Sreeram Kannan


Abstract
Variant calling, in particular, calling SNPs (Single Nucleotide Polymorphisms) is a fundamental task in genomics. While existing packages offer excellent performance on calling SNPs which have uniquely mapped reads, they suffer in loci where the reads are multiply mapped, and are unable to make any reliable calls. Variants in multiply mapped loci can arise, for example in long segmental duplications, and can play important role in evolution and disease. In this paper, we develop a new SNP caller named abSNP, which offers three innovations. (a) abSNP calls SNPs from RNA-Seq data. Since RNA-Seq data is primarily sampled from gene regions, this method is inexpensive. (b) abSNP is able to successfully make calls on repetitive gene regions by exploiting the quality scores of multiply mapped reads carefully in order to make variant calls. (c) abSNP exploits a specific feature of RNA-Seq data, namely the varying abundance of different genes, in order to identify which repetitive copy a particular read is sampled from. We demonstrate that the proposed method offers significant performance gains on repetitive regions in simulated data. In particular, the algorithm is able to achieve near-perfect sensitivity on high-coverage SNPs, even when multiply mapped.

Cite as

Shunfu Mao, Soheil Mohajer, Kannan Ramachandran, David Tse, and Sreeram Kannan. abSNP: RNA-Seq SNP Calling in Repetitive Regions via Abundance Estimation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 15:1-15:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{mao_et_al:LIPIcs.WABI.2017.15,
  author =	{Mao, Shunfu and Mohajer, Soheil and Ramachandran, Kannan and Tse, David and Kannan, Sreeram},
  title =	{{abSNP: RNA-Seq SNP Calling in Repetitive Regions via Abundance Estimation}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{15:1--15:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.15},
  URN =		{urn:nbn:de:0030-drops-76582},
  doi =		{10.4230/LIPIcs.WABI.2017.15},
  annote =	{Keywords: RNA-Seq, SNP Calling, Repetitive Region, Multiply Mapped Reads, Abundance Estimation}
}
Document
All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels

Authors: Sarvesh Nikumbh, Peter Ebert, and Nico Pfeifer


Abstract
Most string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be as important as in the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference. Existing position-aware string kernels have been shown to be useful for the latter scenario. In this work, we propose a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences, and can handle both the aforementioned scenarios. Our approach, \emph{CoMIK}, identifies not just the features useful towards classification but also their locations in the variable-length sequences, as evidenced by the results of three binary classification experiments, aided by recently introduced visualization techniques. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels.

Cite as

Sarvesh Nikumbh, Peter Ebert, and Nico Pfeifer. All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 16:1-16:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{nikumbh_et_al:LIPIcs.WABI.2017.16,
  author =	{Nikumbh, Sarvesh and Ebert, Peter and Pfeifer, Nico},
  title =	{{All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{16:1--16:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.16},
  URN =		{urn:nbn:de:0030-drops-76459},
  doi =		{10.4230/LIPIcs.WABI.2017.16},
  annote =	{Keywords: Multiple instance learning, conformal MI kernels, 5C, Hi-C}
}
Document
Forbidden Time Travel: Characterization of Time-Consistent Tree Reconciliation Maps

Authors: Nikolai Nojgaard, Manuela Geiß, Daniel Merkle, Peter F. Stadler, Nicolas Wieseke, and Marc Hellmuth


Abstract
Motivation: In the absence of horizontal gene transfer it is possible to reconstruct the history of gene families from empirically determined orthology relations, which are equivalent to event-labeled gene trees. Knowledge of the event labels considerably simplifies the problem of reconciling a gene tree T with a species trees S, relative to the reconciliation problem without prior knowledge of the event types. It is well-known that optimal reconciliations in the unlabeled case may violate time-consistency and thus are not biologically feasible. Here we investigate the mathematical structure of the event labeled reconciliation problem with horizontal transfer. Results: We investigate the issue of time-consistency for the event-labeled version of the reconciliation problem, provide a convenient axiomatic framework, and derive a complete characterization of time-consistent reconciliations. This characterization depends on certain weak conditions on the event-labeled gene trees that reflect conditions under which evolutionary events are observable at least in principle. We give an O(|V(T)|log(|V(S)|))-time algorithm to decide whether a time-consistent reconciliation map exists. It does not require the construction of explicit timing maps, but relies entirely on the comparably easy task of checking whether a small auxiliary graph is acyclic. The algorithms are implemented in C++ using the boost graph library and are freely available at https://github.com/Nojgaard/tc-recon. Significance: The combinatorial characterization of time consistency and thus biologically feasible reconciliation is an important step towards the inference of gene family histories with hor- izontal transfer from orthology data, i.e., without presupposed gene and species trees. The fast algorithm to decide time consistency is useful in a broader context because it constitutes an attractive component for all tools that address tree reconciliation problems.

Cite as

Nikolai Nojgaard, Manuela Geiß, Daniel Merkle, Peter F. Stadler, Nicolas Wieseke, and Marc Hellmuth. Forbidden Time Travel: Characterization of Time-Consistent Tree Reconciliation Maps. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 17:1-17:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{nojgaard_et_al:LIPIcs.WABI.2017.17,
  author =	{Nojgaard, Nikolai and Gei{\ss}, Manuela and Merkle, Daniel and Stadler, Peter F. and Wieseke, Nicolas and Hellmuth, Marc},
  title =	{{Forbidden Time Travel: Characterization of Time-Consistent Tree Reconciliation Maps}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{17:1--17:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.17},
  URN =		{urn:nbn:de:0030-drops-76362},
  doi =		{10.4230/LIPIcs.WABI.2017.17},
  annote =	{Keywords: Tree Reconciliation, Horizontal Gene Transfer, Reconciliation Map, Time-Consistency, History of gene families}
}
Document
Rainbowfish: A Succinct Colored de Bruijn Graph Representation

Authors: Fatemeh Almodaresi, Prashant Pandey, and Rob Patro


Abstract
The colored de Bruijn graph— a variant of the de Bruijn graph which associates each edge (i.e., k-mer) with some set of colors - is an increasingly important combinatorial structure in computational biology. Iqbal et al. demonstrated the utility of this structure for representing and assembling a collection (population) of genomes, and showed how it can be used to accurately detect genetic variants. Muggli et al. introduced VARI, a representation of the colored de Bruijn graph that adopts the BOSS representation for the de Bruijn graph topology and achieves considerable savings in space over Cortex, albeit with some sacrifice in speed. The memory-efficient representation of VARI allows the colored de Bruijn graph to be constructed and analyzed for large datasets, beyond what is possible with Cortex. In this paper, we introduce Rainbowfish, a succinct representation of the color information of the colored de Bruijn graph that reduces the space usage even further. Our representation also uses BOSS to represent the de Bruijn graph, but decomposes the color sets based on an equivalence relation and exploits the inherent skewness in the distribution of these color sets. The Rainbowfish representation is compressed based on the 0th-order entropy of the color sets, which can lead to a significant reduction in the space required to store the relevant information for each edge. In practice, Rainbowfish achieves up to a 20x improvement in space over VARI. Rainbowfish is written in C++11 and is available at https://github.com/COMBINE-lab/rainbowfish.

Cite as

Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: A Succinct Colored de Bruijn Graph Representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 18:1-18:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{almodaresi_et_al:LIPIcs.WABI.2017.18,
  author =	{Almodaresi, Fatemeh and Pandey, Prashant and Patro, Rob},
  title =	{{Rainbowfish: A Succinct Colored de Bruijn Graph Representation}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{18:1--18:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.18},
  URN =		{urn:nbn:de:0030-drops-76576},
  doi =		{10.4230/LIPIcs.WABI.2017.18},
  annote =	{Keywords: de Bruijn graph, succinct data structures, rank and select operation, colored de Bruijn graph}
}
Document
ThIEF: Finding Genome-wide Trajectories of Epigenetics Marks

Authors: Anton Polishko, Md. Abid Hasan, Weihua Pan, Evelien M. Bunnik, Karine Le Roch, and Stefano Lonardi


Abstract
We address the problem of comparing multiple genome-wide maps representing nucleosome positions or specific histone marks. These maps can originate from the comparative analysis of ChIP-Seq/MNase-Seq/FAIRE-Seq data for different cell types/tissues or multiple time points. The input to the problem is a set of maps, each of which is a list of genomics locations for nucleosomes or histone marks. The output is an alignment of nucleosomes/histone marks across time points (that we call trajectories), allowing small movements and gaps in some of the maps. We present a tool called ThIEF (TrackIng of Epigenetic Features) that can efficiently compute these trajectories. ThIEF comes into two "flavors": ThIEF:Iterative finds the trajectories progressively using bipartite matching, while ThIEF:LP solves a k-partite matching problem on a hyper graph using linear programming. ThIEF:LP is guaranteed to find the optimal solution, but it is slower than ThIEF:Iterative. We demonstrate the utility of ThIEF by providing an example of applications on the analysis of temporal nucleosome maps for the human malaria parasite. As a surprisingly remarkable result, we show that the output of ThIEF can be used to produce a supervised classifier that can accurately predict the position of stable nucleosomes (i.e., nucleosomes present in all time points) and unstable nucleosomes (i.e., present in at most half of the time points) from the primary DNA sequence. To the best of our knowledge, this is the first result on the prediction of the dynamics of nucleosomes solely based on their DNA binding preference. Software is available at https://github.com/ucrbioinfo/ThIEF.

Cite as

Anton Polishko, Md. Abid Hasan, Weihua Pan, Evelien M. Bunnik, Karine Le Roch, and Stefano Lonardi. ThIEF: Finding Genome-wide Trajectories of Epigenetics Marks. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 19:1-19:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{polishko_et_al:LIPIcs.WABI.2017.19,
  author =	{Polishko, Anton and Hasan, Md. Abid and Pan, Weihua and Bunnik, Evelien M. and Le Roch, Karine and Lonardi, Stefano},
  title =	{{ThIEF: Finding Genome-wide Trajectories of Epigenetics Marks}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{19:1--19:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.19},
  URN =		{urn:nbn:de:0030-drops-76375},
  doi =		{10.4230/LIPIcs.WABI.2017.19},
  annote =	{Keywords: Nucleosomes, Histone Marks, Histone Tail Modifications, Epigenetics, Genomics}
}
Document
Byte-Aligned Pattern Matching in Encoded Genomic Sequences

Authors: Petr Procházka and Jan Holub


Abstract
In this article, we propose a novel pattern matching algorithm, called BAPM, that performs searching in the encoded genomic sequences. The algorithm works at the level of single bytes and it achieves sublinear performance on average. The preprocessing phase of the algorithm is linear with respect to the size of the searched pattern m. A simple O(m)-space data structure is used to store all factors (with a defined length) of the searched pattern. These factors are later searched during the searching phase which ensures sublinear time on average. Our algorithm significantly overcomes the state-of-the-art pattern matching algorithms in the locate time on middle and long patterns. Furthermore, it is able to cooperate very easily with the block q-gram inverted index. The block q-gram inverted index together with our pattern matching algorithm achieve superior results in terms of locate time to the current index data structures for less frequent patterns. We present experimental results using real genomic data. These results prove efficiency of our algorithm.

Cite as

Petr Procházka and Jan Holub. Byte-Aligned Pattern Matching in Encoded Genomic Sequences. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 20:1-20:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{prochazka_et_al:LIPIcs.WABI.2017.20,
  author =	{Proch\'{a}zka, Petr and Holub, Jan},
  title =	{{Byte-Aligned Pattern Matching in Encoded Genomic Sequences}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{20:1--20:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.20},
  URN =		{urn:nbn:de:0030-drops-76538},
  doi =		{10.4230/LIPIcs.WABI.2017.20},
  annote =	{Keywords: genomic sequences, pattern matching, q-gram inverted index}
}
Document
Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping

Authors: Jens Quedenfeld and Sven Rahmann


Abstract
DNA read mapping has become a ubiquitous task in bioinformatics. New technologies provide ever longer DNA reads (several thousand basepairs), although at comparatively high error rates (up to 15%), and the reference genome is increasingly not considered as a simple string over ACGT anymore, but as a complex object containing known genetic variants in the population. Conventional indexes based on exact seed matches, in particular the suffix array based FM index, struggle with these changing conditions, so other methods are being considered, and one such alternative is locality sensitive hashing. Here we examine the question whether including single nucleotide polymorphisms (SNPs) in a min-hashing index is beneficial. The answer depends on the population frequency of the SNP, and we analyze several models (from simple to complex) that provide precise answers to this question under various assumptions. Our results also provide sensitivity and specificity values for min-hashing based read mappers and may be used to understand dependencies between the parameters of such methods. We hope that this article will provide a theoretical foundation for a new generation of read mappers.

Cite as

Jens Quedenfeld and Sven Rahmann. Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{quedenfeld_et_al:LIPIcs.WABI.2017.21,
  author =	{Quedenfeld, Jens and Rahmann, Sven},
  title =	{{Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{21:1--21:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.21},
  URN =		{urn:nbn:de:0030-drops-76598},
  doi =		{10.4230/LIPIcs.WABI.2017.21},
  annote =	{Keywords: read mapping, min-Hashing, variant, SNP, analysis of algorithms}
}
Document
Efficient and Accurate Detection of Topologically Associating Domains from Contact Maps

Authors: Abbas Roayaei Ardakany and Stefano Lonardi


Abstract
Continuous improvements to high-throughput conformation capture (Hi-C) are revealing richerinformation about the spatial organization of the chromatin and its role in cellular functions.Several studies have confirmed the existence of structural features of the genome 3D organiza-tion that are stable across cell types and conserved across species, calledtopological associatingdomains(TADs). The detection of TADs has become a critical step in the analysis of Hi-C data,e.g., to identify enhancer-promoter associations. Here we presentEast, a novel TAD identifi-cation algorithm based on fast 2D convolution of Haar-like features, that is as accurate as thestate-of-the-art method based on the directionality index, but 75-80x faster.Eastis availablein the public domain at https://github.com/ucrbioinfo/EAST.

Cite as

Abbas Roayaei Ardakany and Stefano Lonardi. Efficient and Accurate Detection of Topologically Associating Domains from Contact Maps. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 22:1-22:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{roayaeiardakany_et_al:LIPIcs.WABI.2017.22,
  author =	{Roayaei Ardakany, Abbas and Lonardi, Stefano},
  title =	{{Efficient and Accurate Detection of Topologically Associating Domains from Contact Maps}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{22:1--22:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.22},
  URN =		{urn:nbn:de:0030-drops-76446},
  doi =		{10.4230/LIPIcs.WABI.2017.22},
  annote =	{Keywords: Chromatin, TADs, 3D genome, Hi-C, contact maps}
}
Document
Outlier Detection in BLAST Hits

Authors: Nidhi Shah, Stephen F. Altschul, and Mihai Pop


Abstract
An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. The similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. On the other hand, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive. We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets. Our experiments demonstrate the potential of our method to be a filtering step before using phylogenetic methods.

Cite as

Nidhi Shah, Stephen F. Altschul, and Mihai Pop. Outlier Detection in BLAST Hits. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 23:1-23:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{shah_et_al:LIPIcs.WABI.2017.23,
  author =	{Shah, Nidhi and Altschul, Stephen F. and Pop, Mihai},
  title =	{{Outlier Detection in BLAST Hits}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{23:1--23:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.23},
  URN =		{urn:nbn:de:0030-drops-76512},
  doi =		{10.4230/LIPIcs.WABI.2017.23},
  annote =	{Keywords: Taxonomy classification, Metagenomics, Sequence alignment, Outlier detection}
}
Document
Finding Local Genome Rearrangements

Authors: Pijus Simonaitis and Krister M. Swenson


Abstract
The Double Cut and Join (DCJ) model of genome rearrangement is well studied due to its mathematical simplicity and power to account for the many events that transform genome architecture. These studies have mostly been devoted to the understanding of minimum length scenarios transforming one genome into another. In this paper we search instead for DCJ rearrangement scenarios that minimize the number of rearrangements whose breakpoints are unlikely due to some biological criteria. We establish a link between this Minimum Local Scenario (MLS) problem and the problem of finding a Maximum Edge-disjoint Cycle Packing (MECP) on an undirected graph. This link leads us to a 3/2-approximation for MLS, as well as an exact integer linear program. From a practical perspective, we briefly report on the applicability of our methods and the potential for computation of distances using a more general DCJ cost function.

Cite as

Pijus Simonaitis and Krister M. Swenson. Finding Local Genome Rearrangements. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 24:1-24:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{simonaitis_et_al:LIPIcs.WABI.2017.24,
  author =	{Simonaitis, Pijus and Swenson, Krister M.},
  title =	{{Finding Local Genome Rearrangements}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{24:1--24:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.24},
  URN =		{urn:nbn:de:0030-drops-76604},
  doi =		{10.4230/LIPIcs.WABI.2017.24},
  annote =	{Keywords: genome rearrangement, double cut and join, maximum edge-disjoint cycle packing, Hi-C, NP-complete, approximation algorithm}
}
Document
Seed-driven Learning of Position Probability Matrices from Large Sequence Sets

Authors: Jarkko Toivonen, Jussi Taipale, and Esko Ukkonen


Abstract
We formulate and analyze a novel seed-driven algorithm SeedHam for PPM learning. To learn a PPM of length l, the algorithm uses the most frequent l-mer of the training data as a seed, and then restricts the learning into a small Hamming neighbourhood of the seed. The SeedHam method is intended for PPM learning from large sequence sets (up to hundreds of Mbases) containing enriched motif instances. A robust variant of the method is introduced that decreases contamination from artefact instances of the motif and thereby allows using larger Hamming neighbourhoods. To solve the motif orientation problem in two-stranded DNA we introduce a novel seed finding rule, based on analysis of the palindromic structure of sequences. Test experiments are reported, that illustrate the relative strengths of different variants of our methods, and show that our algorithms are fast and give stable and accurate results. Availability and implementation: A C++ implementation of the method is available from https://github.com/jttoivon/seedham/ Contact: jarkko.toivonen@cs.helsinki.fi

Cite as

Jarkko Toivonen, Jussi Taipale, and Esko Ukkonen. Seed-driven Learning of Position Probability Matrices from Large Sequence Sets. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 25:1-25:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{toivonen_et_al:LIPIcs.WABI.2017.25,
  author =	{Toivonen, Jarkko and Taipale, Jussi and Ukkonen, Esko},
  title =	{{Seed-driven Learning of Position Probability Matrices from Large Sequence Sets}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{25:1--25:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.25},
  URN =		{urn:nbn:de:0030-drops-76470},
  doi =		{10.4230/LIPIcs.WABI.2017.25},
  annote =	{Keywords: motif finding, transcription factor binding site, sequence analysis, Hamming distance, seed}
}
Document
Improved De Novo Peptide Sequencing using LC Retention Time Information

Authors: Yves Frank, Tomas Hruz, Thomas Tschager, and Valentin Venzin


Abstract
Liquid chromatography combined with tandem mass spectrometry (LC-MS/MS) is an important tool in proteomics for identifying the peptides in a sample. Liquid chromatography temporally separates the peptides and tandem mass spectrometry analyzes the peptides, that elute one after another, by measuring their mass-to-charge ratios and the mass-to-charge ratios of their prefix and suffix fragments. De novo peptide sequencing is the problem of reconstructing the amino acid sequences of the analyzed peptide from this measurement data. While previous approaches solely consider the mass spectrum of the fragments for reconstructing a sequence, we propose to also exploit the information obtained from liquid chromatography. We study the problem of computing a sequence that is not only in accordance with the experimental mass spectrum, but also with the retention time of the separation by liquid chromatography. We consider three models for predicting the retention time of a peptide and develop algorithms for de novo sequencing for each model. An evaluation on experimental data from synthesized peptides for two of these models shows an improved performance compared to not using the chromatographic information.

Cite as

Yves Frank, Tomas Hruz, Thomas Tschager, and Valentin Venzin. Improved De Novo Peptide Sequencing using LC Retention Time Information. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 26:1-26:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{frank_et_al:LIPIcs.WABI.2017.26,
  author =	{Frank, Yves and Hruz, Tomas and Tschager, Thomas and Venzin, Valentin},
  title =	{{Improved De Novo Peptide Sequencing using LC Retention Time Information}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{26:1--26:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.26},
  URN =		{urn:nbn:de:0030-drops-76383},
  doi =		{10.4230/LIPIcs.WABI.2017.26},
  annote =	{Keywords: Computational proteomics, Peptide identification, Mass spectrometry, De novo peptide sequencing, Retention time prediction}
}
Document
Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL

Authors: Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow


Abstract
Here we introduce the Optimal Tree Completion Problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. More formally, given a pair of unrooted binary trees (T,t) where T has leaf set S and t has leaf set R, a subset of S, we wish to add all the leaves from S \ R to t so as to produce a new tree t' on leaf set S that has the minimum distance to T. We show that when the distance is defined by the Robinson-Foulds (RF) distance, an optimal solution can be found in polynomial time. We also present OCTAL, an algorithm that solves this RF Optimal Tree Completion Problem exactly in quadratic time. We report on a simulation study where we complete estimated gene trees using a reference tree that is based on a species tree estimated from a multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach, but the accuracy of the completed gene trees computed by OCTAL depends on how topologically similar the estimated species tree is to the true gene tree. Hence, under conditions with relatively low gene tree heterogeneity, OCTAL can be used to provide highly accurate completions of estimated gene trees. We close with a discussion of future research.

Cite as

Sarah Christensen, Erin K. Molloy, Pranjal Vachaspati, and Tandy Warnow. Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 27:1-27:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


Copy BibTex To Clipboard

@InProceedings{christensen_et_al:LIPIcs.WABI.2017.27,
  author =	{Christensen, Sarah and Molloy, Erin K. and Vachaspati, Pranjal and Warnow, Tandy},
  title =	{{Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL}},
  booktitle =	{17th International Workshop on Algorithms in Bioinformatics (WABI 2017)},
  pages =	{27:1--27:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-050-7},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{88},
  editor =	{Schwartz, Russell and Reinert, Knut},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2017.27},
  URN =		{urn:nbn:de:0030-drops-76392},
  doi =		{10.4230/LIPIcs.WABI.2017.27},
  annote =	{Keywords: phylogenomics, missing data, coalescent-based species tree estimation, gene trees}
}

Filters