LIPIcs, Volume 113

18th International Workshop on Algorithms in Bioinformatics (WABI 2018)



Thumbnail PDF

Event

WABI 2018, August 20-22, 2018, Helsinki, Finland

Editors

Laxmi Parida
Esko Ukkonen

Publication Details

  • published at: 2018-08-02
  • Publisher: Schloss Dagstuhl – Leibniz-Zentrum für Informatik
  • ISBN: 978-3-95977-082-8
  • DBLP: db/conf/wabi/wabi2018

Access Numbers

Documents

No documents found matching your filter selection.
Document
Complete Volume
LIPIcs, Volume 113, WABI'18, Complete Volume

Authors: Laxmi Parida and Esko Ukkonen


Abstract
LIPIcs, Volume 113, WABI'18, Complete Volume

Cite as

18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@Proceedings{parida_et_al:LIPIcs.WABI.2018,
  title =	{{LIPIcs, Volume 113, WABI'18, Complete Volume}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018},
  URN =		{urn:nbn:de:0030-drops-97246},
  doi =		{10.4230/LIPIcs.WABI.2018},
  annote =	{Keywords: Applied computing, Bioinformatics, Theory of computation, Design and analysis of algorithms, Mathematics of computing, Probabilistic inference problem}
}
Document
Front Matter
Front Matter, Table of Contents, Preface, Conference Organization

Authors: Laxmi Parida and Esko Ukkonen


Abstract
Front Matter, Table of Contents, Preface, Conference Organization

Cite as

18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 0:i-0:xvi, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{parida_et_al:LIPIcs.WABI.2018.0,
  author =	{Parida, Laxmi and Ukkonen, Esko},
  title =	{{Front Matter, Table of Contents, Preface, Conference Organization}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{0:i--0:xvi},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.0},
  URN =		{urn:nbn:de:0030-drops-93028},
  doi =		{10.4230/LIPIcs.WABI.2018.0},
  annote =	{Keywords: Front Matter, Table of Contents, Preface, Conference Organization}
}
Document
A Duality-Based Method for Identifying Elemental Balance Violations in Metabolic Network Models

Authors: Hooman Zabeti, Tamon Stephen, Bonnie Berger, and Leonid Chindelevitch


Abstract
Elemental balance, the property of having the same number of each type of atom on both sides of the equation, is a fundamental feature of chemical reactions. In metabolic network models, this property is typically verified on a reaction-by-reaction basis. In this paper we show how violations of elemental balance can be efficiently detected in an entire network, without the need for specifying the chemical formula of each of the metabolites, which enhances a modeler's ability to automatically verify that their model satisfies elemental balance. Our method makes use of duality theory, linear programming, and mixed integer linear programming, and runs efficiently on genome-scale metabolic networks (GSMNs). We detect elemental balance violations in 40 out of 84 metabolic network models in the BiGG database. We also identify a short list of reactions that are candidates for being elementally imbalanced. Out of these candidates, nearly half turn out to be truly imbalanced reactions, and the rest can be seen as witnesses of elemental balance violations elsewhere in the network. The majority of these violations involve a proton imbalance, a known challenge of metabolic network reconstruction. Our approach is efficient, easy to use and powerful. It can be helpful to metabolic network modelers during model verification. Our methods are fully integrated into the MONGOOSE software suite and are available at https://github.com/WGS-TB/MongooseGUI3.

Cite as

Hooman Zabeti, Tamon Stephen, Bonnie Berger, and Leonid Chindelevitch. A Duality-Based Method for Identifying Elemental Balance Violations in Metabolic Network Models. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 1:1-1:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{zabeti_et_al:LIPIcs.WABI.2018.1,
  author =	{Zabeti, Hooman and Stephen, Tamon and Berger, Bonnie and Chindelevitch, Leonid},
  title =	{{A Duality-Based Method for Identifying Elemental Balance Violations in Metabolic Network Models}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{1:1--1:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.1},
  URN =		{urn:nbn:de:0030-drops-93034},
  doi =		{10.4230/LIPIcs.WABI.2018.1},
  annote =	{Keywords: Metabolic network analysis, elemental imbalance, linear programming, model verification}
}
Document
Prefix-Free Parsing for Building Big BWTs

Authors: Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini


Abstract
High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive - a characteristic that can be exploited and enable the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. Therefore, prefix-free parsing eases BWT construction, which is pertinent to many bioinformatics applications.

Cite as

Christina Boucher, Travis Gagie, Alan Kuhnle, and Giovanni Manzini. Prefix-Free Parsing for Building Big BWTs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 2:1-2:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{boucher_et_al:LIPIcs.WABI.2018.2,
  author =	{Boucher, Christina and Gagie, Travis and Kuhnle, Alan and Manzini, Giovanni},
  title =	{{Prefix-Free Parsing for Building Big BWTs}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{2:1--2:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.2},
  URN =		{urn:nbn:de:0030-drops-93044},
  doi =		{10.4230/LIPIcs.WABI.2018.2},
  annote =	{Keywords: Burrows-Wheeler Transform, prefix-free parsing, compression-aware algorithms, genomic databases}
}
Document
Detecting Mutations by eBWT

Authors: Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone


Abstract
In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (eBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the eBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the eBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the eBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity.

Cite as

Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone. Detecting Mutations by eBWT. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 3:1-3:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{prezza_et_al:LIPIcs.WABI.2018.3,
  author =	{Prezza, Nicola and Pisanti, Nadia and Sciortino, Marinella and Rosone, Giovanna},
  title =	{{Detecting Mutations by eBWT}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{3:1--3:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.3},
  URN =		{urn:nbn:de:0030-drops-93051},
  doi =		{10.4230/LIPIcs.WABI.2018.3},
  annote =	{Keywords: BWT, LCP Array, SNPs, Reference-free, Assembly-free}
}
Document
Haplotype-aware graph indexes

Authors: Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin


Abstract
The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Cite as

Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin. Haplotype-aware graph indexes. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{siren_et_al:LIPIcs.WABI.2018.4,
  author =	{Sir\'{e}n, Jouni and Garrison, Erik and Novak, Adam M. and Paten, Benedict J. and Durbin, Richard},
  title =	{{Haplotype-aware graph indexes}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{4:1--4:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.4},
  URN =		{urn:nbn:de:0030-drops-93060},
  doi =		{10.4230/LIPIcs.WABI.2018.4},
  annote =	{Keywords: FM-indexes, variation graphs, haplotypes}
}
Document
Reconciling Multiple Genes Trees via Segmental Duplications and Losses

Authors: Riccardo Dondi, Manuel Lafond, and Celine Scornavacca


Abstract
Reconciling gene trees with a species tree is a fundamental problem to understand the evolution of gene families. Many existing approaches reconcile each gene tree independently. However, it is well-known that the evolution of gene families is interconnected. In this paper, we extend a previous approach to reconcile a set of gene trees with a species tree based on segmental macro-evolutionary events, where segmental duplication events and losses are associated with cost delta and lambda, respectively. We show that the problem is polynomial-time solvable when delta <= lambda (via LCA-mapping), while if delta > lambda the problem is NP-hard, even when lambda = 0 and a single gene tree is given, solving a long standing open problem on the complexity of the reconciliation problem. On the positive side, we give a fixed-parameter algorithm for the problem, where the parameters are delta/lambda and the number d of segmental duplications, of time complexity O(ceil[delta/lambda]^d * n * delta/lambda). Finally, we demonstrate the usefulness of this algorithm on two previously studied real datasets: we first show that our method can be used to confirm or refute hypothetical segmental duplications on a set of 16 eukaryotes, then show how we can detect whole genome duplications in yeast genomes.

Cite as

Riccardo Dondi, Manuel Lafond, and Celine Scornavacca. Reconciling Multiple Genes Trees via Segmental Duplications and Losses. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 5:1-5:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{dondi_et_al:LIPIcs.WABI.2018.5,
  author =	{Dondi, Riccardo and Lafond, Manuel and Scornavacca, Celine},
  title =	{{Reconciling Multiple Genes Trees via Segmental Duplications and Losses}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{5:1--5:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.5},
  URN =		{urn:nbn:de:0030-drops-93079},
  doi =		{10.4230/LIPIcs.WABI.2018.5},
  annote =	{Keywords: Gene trees/species tree reconciliation, phylogenetics, computational complexity, fixed-parameter algorithms}
}
Document
Protein Classification with Improved Topological Data Analysis

Authors: Tamal K. Dey and Sayan Mandal


Abstract
Automated annotation and analysis of protein molecules have long been a topic of interest due to immediate applications in medicine and drug design. In this work, we propose a topology based, fast, scalable, and parameter-free technique to generate protein signatures. We build an initial simplicial complex using information about the protein's constituent atoms, including its radius and existing chemical bonds, to model the hierarchical structure of the molecule. Simplicial collapse is used to construct a filtration which we use to compute persistent homology. This information constitutes our signature for the protein. In addition, we demonstrate that this technique scales well to large proteins. Our method shows sizable time and memory improvements compared to other topology based approaches. We use the signature to train a protein domain classifier. Finally, we compare this classifier against models built from state-of-the-art structure-based protein signatures on standard datasets to achieve a substantial improvement in accuracy.

Cite as

Tamal K. Dey and Sayan Mandal. Protein Classification with Improved Topological Data Analysis. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 6:1-6:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{dey_et_al:LIPIcs.WABI.2018.6,
  author =	{Dey, Tamal K. and Mandal, Sayan},
  title =	{{Protein Classification with Improved Topological Data Analysis}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{6:1--6:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.6},
  URN =		{urn:nbn:de:0030-drops-93082},
  doi =		{10.4230/LIPIcs.WABI.2018.6},
  annote =	{Keywords: topological data analysis, persistent homology, simplicial collapse, supervised learning, topology based protein feature vector, protein classification}
}
Document
A Dynamic Algorithm for Network Propagation

Authors: Barak Sternberg and Roded Sharan


Abstract
Network propagation is a powerful transformation that amplifies signal-to-noise ratio in biological and other data. To date, most of its applications in the biological domain employed standard techniques for its computation that require O(m) time for a network with n vertices and m edges. When applied in a dynamic setting where the network is constantly modified, the cost of these computations becomes prohibitive. Here we study, for the first time in the biological context, the complexity of dynamic algorithms for network propagation. We develop a vertex decremental algorithm that is motivated by various biological applications and can maintain propagation scores over general weights at an amortized cost of O(m/(n^{1/4})) per update. In application to real networks, the dynamic algorithm achieves significant, 50- to 100-fold, speedups over conventional static methods for network propagation, demonstrating its great potential in practice.

Cite as

Barak Sternberg and Roded Sharan. A Dynamic Algorithm for Network Propagation. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 7:1-7:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{sternberg_et_al:LIPIcs.WABI.2018.7,
  author =	{Sternberg, Barak and Sharan, Roded},
  title =	{{A Dynamic Algorithm for Network Propagation}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{7:1--7:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.7},
  URN =		{urn:nbn:de:0030-drops-93095},
  doi =		{10.4230/LIPIcs.WABI.2018.7},
  annote =	{Keywords: Network propagation, Dynamic graph algorithm, protein-protein interaction network}
}
Document
New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy

Authors: Qiuyi (Richard) Zhang, Satish Rao, and Tandy Warnow


Abstract
Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch lengths are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was DCM_NJ, published in SODA 2001. The main empirical advantage of DCM_NJ over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, DCM_NJ is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy. In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of DCM_NJ but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other AFC methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).

Cite as

Qiuyi (Richard) Zhang, Satish Rao, and Tandy Warnow. New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 8:1-8:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{zhang_et_al:LIPIcs.WABI.2018.8,
  author =	{Zhang, Qiuyi (Richard) and Rao, Satish and Warnow, Tandy},
  title =	{{New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{8:1--8:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.8},
  URN =		{urn:nbn:de:0030-drops-93108},
  doi =		{10.4230/LIPIcs.WABI.2018.8},
  annote =	{Keywords: phylogeny estimation, short quartets, sample complexity, absolute fast converging methods, neighbor joining, maximum likelihood}
}
Document
An Average-Case Sublinear Exact Li and Stephens Forward Algorithm

Authors: Yohei M. Rosen and Benedict J. Paten


Abstract
Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithms as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated. To make the Li and Stephens forward algorithm for these datasets computationally tractable, we have created a numerically exact version of the algorithm with observed average case O(nk^{0.35}) runtime in number of genetic sites n and reference panel size k. This avoids any tradeoff between runtime and model complexity. We demonstrate that our approach also provides a succinct data structure for general purpose haplotype data storage. We discuss generalizations of our algorithmic techniques to other hidden Markov models.

Cite as

Yohei M. Rosen and Benedict J. Paten. An Average-Case Sublinear Exact Li and Stephens Forward Algorithm. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 9:1-9:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{rosen_et_al:LIPIcs.WABI.2018.9,
  author =	{Rosen, Yohei M. and Paten, Benedict J.},
  title =	{{An Average-Case Sublinear Exact Li and Stephens Forward Algorithm}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{9:1--9:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.9},
  URN =		{urn:nbn:de:0030-drops-93116},
  doi =		{10.4230/LIPIcs.WABI.2018.9},
  annote =	{Keywords: Haplotype, Hidden Markov Model, Forward Algorithm, Lazy Evaluation}
}
Document
External memory BWT and LCP computation for sequence collections with applications

Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles


Abstract
We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs.

Cite as

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 10:1-10:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{egidi_et_al:LIPIcs.WABI.2018.10,
  author =	{Egidi, Lavinia and Louza, Felipe A. and Manzini, Giovanni and Telles, Guilherme P.},
  title =	{{External memory BWT and LCP computation for sequence collections with applications}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{10:1--10:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.10},
  URN =		{urn:nbn:de:0030-drops-93122},
  doi =		{10.4230/LIPIcs.WABI.2018.10},
  annote =	{Keywords: Burrows-Wheeler Transform, Longest Common Prefix Array, All pairs suffix-prefix overlaps, Succinct de Bruijn graph, Maximal repeats}
}
Document
Kermit: Guided Long Read Assembly using Coloured Overlap Graphs

Authors: Riku Walve, Pasi Rastas, and Leena Salmela


Abstract
With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly.

Cite as

Riku Walve, Pasi Rastas, and Leena Salmela. Kermit: Guided Long Read Assembly using Coloured Overlap Graphs. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 11:1-11:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{walve_et_al:LIPIcs.WABI.2018.11,
  author =	{Walve, Riku and Rastas, Pasi and Salmela, Leena},
  title =	{{Kermit: Guided Long Read Assembly using Coloured Overlap Graphs}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{11:1--11:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.11},
  URN =		{urn:nbn:de:0030-drops-93135},
  doi =		{10.4230/LIPIcs.WABI.2018.11},
  annote =	{Keywords: Genome assembly, Linkage maps, Coloured overlap graph}
}
Document
A Succinct Solution to Rmap Alignment

Authors: Martin D. Muggli, Simon J. Puglisi, and Christina Boucher


Abstract
We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Lastly, we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time. Kohdista is available at https://github.com/mmuggli/KOHDISTA/.

Cite as

Martin D. Muggli, Simon J. Puglisi, and Christina Boucher. A Succinct Solution to Rmap Alignment. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 12:1-12:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{muggli_et_al:LIPIcs.WABI.2018.12,
  author =	{Muggli, Martin D. and Puglisi, Simon J. and Boucher, Christina},
  title =	{{A Succinct Solution to Rmap Alignment}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{12:1--12:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.12},
  URN =		{urn:nbn:de:0030-drops-93143},
  doi =		{10.4230/LIPIcs.WABI.2018.12},
  annote =	{Keywords: Optical mapping, index based data structures, FM-index, graph algorithms}
}
Document
Spalter: A Meta Machine Learning Approach to Distinguish True DNA Variants from Sequencing Artefacts

Authors: Till Hartmann and Sven Rahmann


Abstract
Being able to distinguish between true DNA variants and technical sequencing artefacts is a fundamental task in whole genome, exome or targeted gene analysis. Variant calling tools provide diagnostic parameters, such as strand bias or an aggregated overall quality for each called variant, to help users make an informed choice about which variants to accept or discard. Having several such quality indicators poses a problem for the users of variant callers because they need to set or adjust thresholds for each such indicator. Alternatively, machine learning methods can be used to train a classifier based on these indicators. This approach needs large sets of labeled training data, which is not easily available. The new approach presented here relies on the idea that a true DNA variant exists independently of technical features of the read in which it appears (e.g. base quality, strand, position in the read). Therefore the nucleotide separability classification problem - predicting the nucleotide state of each read in a given pileup based on technical features only - should be near impossible to solve for true variants. Nucleotide separability, i.e. achievable classification accuracy, can either be used to distinguish between true variants and technical artefacts directly, using a thresholding approach, or it can be used as a meta-feature to train a separability-based classifier. This article explores both possibilities with promising results, showing accuracies around 90%.

Cite as

Till Hartmann and Sven Rahmann. Spalter: A Meta Machine Learning Approach to Distinguish True DNA Variants from Sequencing Artefacts. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 13:1-13:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{hartmann_et_al:LIPIcs.WABI.2018.13,
  author =	{Hartmann, Till and Rahmann, Sven},
  title =	{{Spalter: A Meta Machine Learning Approach to Distinguish True DNA Variants from Sequencing Artefacts}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{13:1--13:8},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.13},
  URN =		{urn:nbn:de:0030-drops-93158},
  doi =		{10.4230/LIPIcs.WABI.2018.13},
  annote =	{Keywords: variant calling, sequencing error, technical artefact, meta machine learning, classification}
}
Document
Essential Simplices in Persistent Homology and Subtle Admixture Detection

Authors: Saugata Basu, Filippo Utro, and Laxmi Parida


Abstract
We introduce a robust mathematical definition of the notion of essential elements in a basis of the homology space and prove that these elements are unique. Next we give a novel visualization of the essential elements of the basis of the homology space through a rainfall-like plot (RFL). This plot is data-centric, i.e., is associated with the individual samples of the data, as opposed to the structure-centric barcodes of persistent homology. The proof-of-concept was tested on data generated by SimRA that simulates different admixture scenarios. We show that the barcode analysis can be used not just to detect the presence of admixture but also estimate the number of admixed populations. We also demonstrate that data-centric RFL plots have the potential to further disentangle the common history into admixture events and relative timing of the events, even in very complex scenarios.

Cite as

Saugata Basu, Filippo Utro, and Laxmi Parida. Essential Simplices in Persistent Homology and Subtle Admixture Detection. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 14:1-14:10, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{basu_et_al:LIPIcs.WABI.2018.14,
  author =	{Basu, Saugata and Utro, Filippo and Parida, Laxmi},
  title =	{{Essential Simplices in Persistent Homology and Subtle Admixture Detection}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{14:1--14:10},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.14},
  URN =		{urn:nbn:de:0030-drops-93166},
  doi =		{10.4230/LIPIcs.WABI.2018.14},
  annote =	{Keywords: population admixture, topological data analysis, persistent homology, population evolution}
}
Document
Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Authors: Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen


Abstract
Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting.

Cite as

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{norri_et_al:LIPIcs.WABI.2018.15,
  author =	{Norri, Tuukka and Cazaux, Bastien and Kosolobov, Dmitry and M\"{a}kinen, Veli},
  title =	{{Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{15:1--15:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.15},
  URN =		{urn:nbn:de:0030-drops-93175},
  doi =		{10.4230/LIPIcs.WABI.2018.15},
  annote =	{Keywords: Pan-genome indexing, founder reconstruction, dynamic programming, positional Burrows-Wheeler transform, range minimum query}
}
Document
Multiple-Choice Knapsack for Assigning Partial Atomic Charges in Drug-Like Molecules

Authors: Martin S. Engler, Bertrand Caron, Lourens Veen, Daan P. Geerke, Alan E. Mark, and Gunnar W. Klau


Abstract
A key factor in computational drug design is the consistency and reliability with which intermolecular interactions between a wide variety of molecules can be described. Here we present a procedure to efficiently, reliably and automatically assign partial atomic charges to atoms based on known distributions. We formally introduce the molecular charge assignment problem, where the task is to select a charge from a set of candidate charges for every atom of a given query molecule. Charges are accompanied by a score that depends on their observed frequency in similar neighbourhoods (chemical environments) in a database of previously parameterised molecules. The aim is to assign the charges such that the total charge equals a known target charge within a margin of error while maximizing the sum of the charge scores. We show that the problem is a variant of the well-studied multiple-choice knapsack problem and thus weakly NP-complete. We propose solutions based on Integer Linear Programming and a pseudo-polynomial time Dynamic Programming algorithm. We show that the results obtained for novel molecules not included in the database are comparable to the ones obtained performing explicit charge calculations while decreasing the time to determine partial charges for a molecule by several orders of magnitude, that is, from hours or even days to below a second. Our software is openly available at https://github.com/enitram/charge_assign.

Cite as

Martin S. Engler, Bertrand Caron, Lourens Veen, Daan P. Geerke, Alan E. Mark, and Gunnar W. Klau. Multiple-Choice Knapsack for Assigning Partial Atomic Charges in Drug-Like Molecules. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 16:1-16:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{engler_et_al:LIPIcs.WABI.2018.16,
  author =	{Engler, Martin S. and Caron, Bertrand and Veen, Lourens and Geerke, Daan P. and Mark, Alan E. and Klau, Gunnar W.},
  title =	{{Multiple-Choice Knapsack for Assigning Partial Atomic Charges in Drug-Like Molecules}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{16:1--16:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.16},
  URN =		{urn:nbn:de:0030-drops-93187},
  doi =		{10.4230/LIPIcs.WABI.2018.16},
  annote =	{Keywords: Multiple-choice knapsack, integer linear programming, pseudo-polynomial dynamic programming, partial charge assignment, molecular dynamics simulations}
}
Document
l1-Penalised Ordinal Polytomous Regression Estimators with Application to Gene Expression Studies

Authors: Stéphane Chrétien, Christophe Guyeux, and Serge Moulin


Abstract
Qualitative but ordered random variables, such as severity of a pathology, are of paramount importance in biostatistics and medicine. Understanding the conditional distribution of such qualitative variables as a function of other explanatory variables can be performed using a specific regression model known as ordinal polytomous regression. Variable selection in the ordinal polytomous regression model is a computationally difficult combinatorial optimisation problem which is however crucial when practitioners need to understand which covariates are physically related to the output and which covariates are not. One easy way to circumvent the computational hardness of variable selection is to introduce a penalised maximum likelihood estimator based on some well chosen non-smooth penalisation function such as, e.g., the l_1-norm. In the case of the Gaussian linear model, the l_1-penalised least-squares estimator, also known as LASSO estimator, has attracted a lot of attention in the last decade, both from the theoretical and algorithmic viewpoints. However, even in the Gaussian linear model, accurate calibration of the relaxation parameter, i.e., the relative weight of the penalisation term in the estimation cost function is still considered a difficult problem that has to be addressed with caution. In the present paper, we apply l_1-penalisation to the ordinal polytomous regression model and compare several hyper-parameter calibration strategies. Our main contributions are: (a) a useful and simple l_1 penalised estimator for ordinal polytomous regression and a thorough description of how to apply Nesterov's accelerated gradient and the online Frank-Wolfe methods to the problem of computing this estimator, (b) a new hyper-parameter calibration method for the proposed model, based on the QUT idea of Giacobino et al. and (c) a code which can be freely used that implements the proposed estimation procedure.

Cite as

Stéphane Chrétien, Christophe Guyeux, and Serge Moulin. l1-Penalised Ordinal Polytomous Regression Estimators with Application to Gene Expression Studies. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 17:1-17:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{chretien_et_al:LIPIcs.WABI.2018.17,
  author =	{Chr\'{e}tien, St\'{e}phane and Guyeux, Christophe and Moulin, Serge},
  title =	{{l1-Penalised Ordinal Polytomous Regression Estimators with Application to Gene Expression Studies}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{17:1--17:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.17},
  URN =		{urn:nbn:de:0030-drops-93199},
  doi =		{10.4230/LIPIcs.WABI.2018.17},
  annote =	{Keywords: LASSO, ordinal polytomous regression, Quantile Universal Threshold, Frank-Wolfe algorithm, Nesterov algorithm}
}
Document
Differentially Mutated Subnetworks Discovery

Authors: Morteza Chalabi Hajkarim, Eli Upfal, and Fabio Vandin


Abstract
We study the problem of identifying differentially mutated subnetworks of a large gene-gene interaction network, that is, subnetworks that display a significant difference in mutation frequency in two sets of cancer samples. We formally define the associated computational problem and show that the problem is NP-hard. We propose a novel and efficient algorithm, called DAMOKLE to identify differentially mutated subnetworks given genome-wide mutation data for two sets of cancer samples. We prove that DAMOKLE identifies subnetworks with a statistically significant difference in mutation frequency when the data comes from a reasonable generative model, provided enough samples are available. We test DAMOKLE on simulated and real data, showing that DAMOKLE does indeed find subnetworks with significant differences in mutation frequency and that it provides novel insights not obtained by standard methods.

Cite as

Morteza Chalabi Hajkarim, Eli Upfal, and Fabio Vandin. Differentially Mutated Subnetworks Discovery. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 18:1-18:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{hajkarim_et_al:LIPIcs.WABI.2018.18,
  author =	{Hajkarim, Morteza Chalabi and Upfal, Eli and Vandin, Fabio},
  title =	{{Differentially Mutated Subnetworks Discovery}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{18:1--18:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.18},
  URN =		{urn:nbn:de:0030-drops-93202},
  doi =		{10.4230/LIPIcs.WABI.2018.18},
  annote =	{Keywords: Cancer genomics, network analysis, combinatorial algorithm}
}
Document
DGEN: A Test Statistic for Detection of General Introgression Scenarios

Authors: Ryan A. Leo Elworth, Chabrielle Allen, Travis Benedict, Peter Dulworth, and Luay Nakhleh


Abstract
When two species hybridize, one outcome is the integration of genetic material from one species into the genome of the other, a process known as introgression. Detecting introgression in genomic data is a very important question in evolutionary biology. However, given that hybridization occurs between closely related species, a complicating factor for introgression detection is the presence of incomplete lineage sorting, or ILS. The D-statistic, famously referred to as the "ABBA-BABA" test, was proposed for introgression detection in the presence of ILS in data sets that consist of four genomes. More recently, D_FOIL - a set of statistics - was introduced to extend the D-statistic to data sets of five genomes. The major contribution of this paper is demonstrating that the invariants underlying both the D-statistic and D_FOIL can be derived automatically from the probability mass functions of gene tree topologies under the null species tree model and alternative phylogenetic network model. Computational requirements aside, this automatic derivation provides a way to generalize these statistics to data sets of any size and with any scenarios of introgression. We demonstrate the accuracy of the general statistic, which we call D_GEN, on simulated data sets with varying rates of introgression, and apply it to an empirical data set of mosquito genomes. We have implemented D_GEN and made it available, both as a graphical user interface tool and as a command-line tool, as part of the freely available, open-source software package ALPHA (https://github.com/chilleo/ALPHA).

Cite as

Ryan A. Leo Elworth, Chabrielle Allen, Travis Benedict, Peter Dulworth, and Luay Nakhleh. DGEN: A Test Statistic for Detection of General Introgression Scenarios. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 19:1-19:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{elworth_et_al:LIPIcs.WABI.2018.19,
  author =	{Elworth, Ryan A. Leo and Allen, Chabrielle and Benedict, Travis and Dulworth, Peter and Nakhleh, Luay},
  title =	{{DGEN: A Test Statistic for Detection of General Introgression Scenarios}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{19:1--19:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.19},
  URN =		{urn:nbn:de:0030-drops-93218},
  doi =		{10.4230/LIPIcs.WABI.2018.19},
  annote =	{Keywords: Introgression, genealogies, phylogenetic networks}
}
Document
PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats

Authors: Mehrdad Mansouri, Julian Booth, Margaryta Vityaz, Cedric Chauve, and Leonid Chindelevitch


Abstract
Variable-Number Tandem Repeats (VNTR) are genomic regions where a short sequence of DNA is repeated with no space in between repeats. While a fixed set of VNTRs is typically identified for a given species, the copy number at each VNTR varies between individuals within a species. Although VNTRs are found in both prokaryotic and eukaryotic genomes, the methodology called multi-locus VNTR analysis (MLVA) is widely used to distinguish different strains of bacteria, as well as cluster strains that might be epidemiologically related and investigate evolutionary rates. We propose PRINCE (Processing Reads to Infer the Number of Copies via Estimation), an algorithm that is able to accurately estimate the copy number of a VNTR given the sequence of a single repeat unit and a set of short reads from a whole-genome sequence (WGS) experiment. This is a challenging problem, especially in the cases when the repeat region is longer than the expected read length. Our proposed method computes a statistical approximation of the local coverage inside the repeat region. This approximation is then mapped to the copy number using a linear function whose parameters are fitted to simulated data. We test PRINCE on the genomes of three datasets of Mycobacterium tuberculosis strains and show that it is more than twice as accurate as a previous method. An implementation of PRINCE in the Python language is freely available at https://github.com/WGS-TB/PythonPRINCE.

Cite as

Mehrdad Mansouri, Julian Booth, Margaryta Vityaz, Cedric Chauve, and Leonid Chindelevitch. PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 20:1-20:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{mansouri_et_al:LIPIcs.WABI.2018.20,
  author =	{Mansouri, Mehrdad and Booth, Julian and Vityaz, Margaryta and Chauve, Cedric and Chindelevitch, Leonid},
  title =	{{PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{20:1--20:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.20},
  URN =		{urn:nbn:de:0030-drops-93227},
  doi =		{10.4230/LIPIcs.WABI.2018.20},
  annote =	{Keywords: Variable-Number Tandem Repeats, Copy number, Bacterial genomics}
}
Document
Degenerate String Comparison and Applications

Authors: Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone


Abstract
A generalised degenerate string (GD string) S^ is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length k_i but this length can vary between different sets. We denote the sum of these lengths k_0, k_1,...,k_{n-1} by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W,n^2}N)-time algorithm for computing all palindromes in S^. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in S^. Finally, proof-of-concept experimental results are presented using real protein datasets.

Cite as

Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, and Giovanna Rosone. Degenerate String Comparison and Applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 21:1-21:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{alzamel_et_al:LIPIcs.WABI.2018.21,
  author =	{Alzamel, Mai and Ayad, Lorraine A. K. and Bernardini, Giulia and Grossi, Roberto and Iliopoulos, Costas S. and Pisanti, Nadia and Pissis, Solon P. and Rosone, Giovanna},
  title =	{{Degenerate String Comparison and Applications}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{21:1--21:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.21},
  URN =		{urn:nbn:de:0030-drops-93236},
  doi =		{10.4230/LIPIcs.WABI.2018.21},
  annote =	{Keywords: degenerate strings, generalised degenerate strings, elastic-degenerate strings, string comparison, palindromes}
}
Document
A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression

Authors: Nikolai Karpov, Salem Malikic, Md. Khaledur Rahman, and S. Cenk Sahinalp


Abstract
We introduce a new edit distance measure between a pair of "clonal trees", each representing the progression and mutational heterogeneity of a tumor sample, constructed by the use of single cell or bulk high throughput sequencing data. In a clonal tree, each vertex represents a specific tumor clone, and is labeled with one or more mutations in a way that each mutation is assigned to the oldest clone that harbors it. Given two clonal trees, our multi-labeled tree edit distance (MLTED) measure is defined as the minimum number of mutation/label deletions, (empty) leaf deletions, and vertex (clonal) expansions, applied in any order, to convert each of the two trees to the maximal common tree. We show that the MLTED measure can be computed efficiently in polynomial time and it captures the similarity between trees of different clonal granularity well. We have implemented our algorithm to compute MLTED exactly and applied it to a variety of data sets successfully. The source code of our method can be found in: https://github.com/khaled-rahman/leafDelTED.

Cite as

Nikolai Karpov, Salem Malikic, Md. Khaledur Rahman, and S. Cenk Sahinalp. A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 22:1-22:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{karpov_et_al:LIPIcs.WABI.2018.22,
  author =	{Karpov, Nikolai and Malikic, Salem and Rahman, Md. Khaledur and Sahinalp, S. Cenk},
  title =	{{A Multi-labeled Tree Edit Distance for Comparing "Clonal Trees" of Tumor Progression}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{22:1--22:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.22},
  URN =		{urn:nbn:de:0030-drops-93242},
  doi =		{10.4230/LIPIcs.WABI.2018.22},
  annote =	{Keywords: Intra-tumor heterogeneity, tumor evolution, multi-labeled tree, tree edit distance, dynamic programming}
}
Document
Heuristic Algorithms for the Maximum Colorful Subtree Problem

Authors: Kai Dührkop, Marie A. Lataretu, W. Timothy J. White, and Sebastian Böcker


Abstract
In metabolomics, small molecules are structurally elucidated using tandem mass spectrometry (MS/MS); this computational task can be formulated as the Maximum Colorful Subtree problem, which is NP-hard. Unfortunately, data from a single metabolite requires us to solve hundreds or thousands of instances of this problem - and in a single Liquid Chromatography MS/MS run, hundreds or thousands of metabolites are measured. Here, we comprehensively evaluate the performance of several heuristic algorithms for the problem. Unfortunately, as is often the case in bioinformatics, the structure of the (chemically) true solution is not known to us; therefore we can only evaluate against the optimal solution of an instance. Evaluating the quality of a heuristic based on scores can be misleading: Even a slightly suboptimal solution can be structurally very different from the optimal solution, but it is the structure of a solution and not its score that is relevant for the downstream analysis. To this end, we propose a different evaluation setup: Given a set of candidate instances of which exactly one is known to be correct, the heuristic in question solves each instance to the best of its ability, producing a score for each instance, which is then used to rank the instances. We then evaluate whether the correct instance is ranked highly by the heuristic. We find that one particular heuristic consistently ranks the correct instance in a top position. We also find that the scores of the best heuristic solutions are very close to the optimal score; in contrast, the structure of the solutions can deviate significantly from the optimal structures. Integrating the heuristic allowed us to speed up computations in practice by a factor of 100-fold.

Cite as

Kai Dührkop, Marie A. Lataretu, W. Timothy J. White, and Sebastian Böcker. Heuristic Algorithms for the Maximum Colorful Subtree Problem. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 23:1-23:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{duhrkop_et_al:LIPIcs.WABI.2018.23,
  author =	{D\"{u}hrkop, Kai and Lataretu, Marie A. and White, W. Timothy J. and B\"{o}cker, Sebastian},
  title =	{{Heuristic Algorithms for the Maximum Colorful Subtree Problem}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{23:1--23:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.23},
  URN =		{urn:nbn:de:0030-drops-93256},
  doi =		{10.4230/LIPIcs.WABI.2018.23},
  annote =	{Keywords: Fragmentation trees, Computational mass spectrometry}
}
Document
Parsimonious Migration History Problem: Complexity and Algorithms

Authors: Mohammed El-Kebir


Abstract
In many evolutionary processes we observe extant taxa in different geographical or anatomical locations. To reconstruct the migration history from a given phylogenetic tree T, one can model locations using an additional character and apply parsimony criteria to assign a location to each internal vertex of T. The migration criterion assumes that migrations are independent events. This assumption does not hold for evolutionary processes where distinct taxa from different lineages comigrate from one location to another in a single event, as is the case in metastasis and in certain infectious diseases. To account for such cases, the comigration criterion was recently introduced, and used as an optimization criterion in the Parsimonious Migration History (PMH) problem. In this work, we show that PMH is NP-hard. In addition, we show that a variant of PMH is fixed parameter tractable (FPT) in the number of locations. On simulated instances of practical size, we demonstrate that our FPT algorithm outperforms a previous integer linear program in terms of running time.

Cite as

Mohammed El-Kebir. Parsimonious Migration History Problem: Complexity and Algorithms. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 24:1-24:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{elkebir:LIPIcs.WABI.2018.24,
  author =	{El-Kebir, Mohammed},
  title =	{{Parsimonious Migration History Problem: Complexity and Algorithms}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{24:1--24:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.24},
  URN =		{urn:nbn:de:0030-drops-93263},
  doi =		{10.4230/LIPIcs.WABI.2018.24},
  annote =	{Keywords: Reconciliation, maximum parsimony, metastasis, infection, phylogenetics, phylogeography, fixed parameter tractability}
}
Document
The Wasserstein Distance as a Dissimilarity Measure for Mass Spectra with Application to Spectral Deconvolution

Authors: Szymon Majewski, Michal Aleksander Ciach, Michal Startek, Wanda Niemyska, Blazej Miasojedow, and Anna Gambin


Abstract
We propose a new approach for the comparison of mass spectra using a metric known in the computer science under the name of Earth Mover's Distance and in mathematics as the Wasserstein distance. We argue that this approach allows for natural and robust solutions to various problems in the analysis of mass spectra. In particular, we show an application to the problem of deconvolution, in which we infer proportions of several overlapping isotopic envelopes of similar compounds. Combined with the previously proposed generator of isotopic envelopes, IsoSpec, our approach works for a wide range of masses and charges in the presence of several types of measurement inaccuracies. To reduce the computational complexity of the solution, we derive an effective implementation of the Interior Point Method as the optimization procedure. The software for mass spectral comparison and deconvolution based on Wasserstein distance is available at https://github.com/mciach/wassersteinms.

Cite as

Szymon Majewski, Michal Aleksander Ciach, Michal Startek, Wanda Niemyska, Blazej Miasojedow, and Anna Gambin. The Wasserstein Distance as a Dissimilarity Measure for Mass Spectra with Application to Spectral Deconvolution. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 25:1-25:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Copy BibTex To Clipboard

@InProceedings{majewski_et_al:LIPIcs.WABI.2018.25,
  author =	{Majewski, Szymon and Ciach, Michal Aleksander and Startek, Michal and Niemyska, Wanda and Miasojedow, Blazej and Gambin, Anna},
  title =	{{The Wasserstein Distance as a Dissimilarity Measure for Mass Spectra with Application to Spectral Deconvolution}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{25:1--25:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.25},
  URN =		{urn:nbn:de:0030-drops-93270},
  doi =		{10.4230/LIPIcs.WABI.2018.25},
  annote =	{Keywords: Mass spectrometry, Tandem mass spectrometry, Wasserstein distance, Earth mover's distance, Similarity measure, Jaccard score, Tanimoto similarity}
}

Filters


Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail