eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
1
474
10.4230/LIPIcs.WABI.2022
article
LIPIcs, Volume 242, WABI 2022, Complete Volume
Boucher, Christina
1
https://orcid.org/0000-0001-9509-9725
Rahmann, Sven
2
https://orcid.org/0000-0002-8536-6065
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Department of Computer Science and Center for Bioinformatics, Saarland University, Saarbrücken, Germany
LIPIcs, Volume 242, WABI 2022, Complete Volume
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022/LIPIcs.WABI.2022.pdf
LIPIcs, Volume 242, WABI 2022, Complete Volume
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
0:i
0:xii
10.4230/LIPIcs.WABI.2022.0
article
Front Matter, Table of Contents, Preface, Conference Organization
Boucher, Christina
1
https://orcid.org/0000-0001-9509-9725
Rahmann, Sven
2
https://orcid.org/0000-0002-8536-6065
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
Department of Computer Science and Center for Bioinformatics, Saarland University, Saarbrücken, Germany
Front Matter, Table of Contents, Preface, Conference Organization
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.0/LIPIcs.WABI.2022.0.pdf
Front Matter
Table of Contents
Preface
Conference Organization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
1:1
1:2
10.4230/LIPIcs.WABI.2022.1
article
Efficient Solutions to Biological Problems Using de Bruijn Graphs (Invited Talk)
Salmela, Leena
1
https://orcid.org/0000-0002-0756-543X
University of Helsinki, Finland
The de Bruijn graph has become a standard method in the analysis of sequencing reads in computational biology due to its ability to represent the information contained in large read sets in small space. A de Bruijn graph represents a set of sequencing reads by its k-mers, i.e. the set of substrings of length k that occur in the reads. In the classical definition, the k-mers are the edges of the graph and the nodes are the k-1 bases long prefixes and suffixes of the k-mers. Usually only k-mers occurring several times in the read set are kept to filter out noise in the data. De Bruijn graphs have been used to solve many problems in computational biology including genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001; Anton Bankevich et al., 2012; Yu Peng et al., 2010], sequencing error correction [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019], reference free variant calling [Raluca Uricaru et al., 2015], indexing read sets [Camille Marchet et al., 2021], and so on. Next I will discuss two of these problems in more depth.
The de Bruijn graph first emerged in computation biology in the context of genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001] where the task is to reconstruct a genome based on sequencing reads. As the de Bruijn graph can represent large read sets compactly, it became the standard approach to assemble short reads [Anton Bankevich et al., 2012; Yu Peng et al., 2010]. In the theoretical framework of de Bruijn graph based genome assembly, a genome is thought to be the Eulerian path in the de Bruijn graph built on the sequencing reads. In practise, the Eulerian path is not unique and thus not useful in the biological context. Therefore, practical implementations report subpaths that are guaranteed to be part of any Eulerian path and thus part of the actual genome. Such models include unitigs, which are nonbranching paths of the de Bruijn graph, and more involved definitions such as omnitigs [Alexandru I. Tomescu and Paul Medvedev, 2017].
In genome assembly the choice of k is a crucial matter. A small k can result in a tangled graph, whereas a too large k will fragment the graph. Furthermore, a different value of k may be optimal for different parts of the genome. Variable order de Bruijn graphs [Christina Boucher et al., 2015; Djamal Belazzougui et al., 2016], which represent de Bruijn graphs of all orders k in a single data structure, have been proposed as a solution but no rigorous definition corresponding to unitigs has been presented. We give the first definition of assembled sequences, i.e. contigs, on such graphs and an algorithm for enumerating them.
Another problem that can be solved with de Bruijn graphs is the correction of sequencing errors [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019]. Because each position of a genome is sequenced several times, it is possible to correct sequencing errors in reads if we can identify data originating from the same genomic region. A de Bruijn graph can be used to represent compactly the reliable information and the individual reads can be corrected by aligning them to the graph.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.1/LIPIcs.WABI.2022.1.pdf
de Bruijn graph
variable order de Bruijn graph
genome assembly
sequencing error correction
k-mers
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
2:1
2:21
10.4230/LIPIcs.WABI.2022.2
article
Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time
Schmidt, Sebastian
1
https://orcid.org/0000-0003-4878-2809
Alanko, Jarno N.
1
https://orcid.org/0000-0002-8003-9225
University of Helsinki, Finland
A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. For that, we present a formalisation of arc-centric bidirected de Bruijn graphs and carefully prove that it accurately models the k-mer spectrum of the input. Our algorithm first constructs the de Bruijn graph in linear time in the length of the input strings (for a fixed-size alphabet). Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.2/LIPIcs.WABI.2022.2.pdf
Spectrum preserving string sets
Eulerian cycle
Suffix tree
Bidirected arc-centric de Bruijn graph
k-mer based methods
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
3:1
3:22
10.4230/LIPIcs.WABI.2022.3
article
Predicting Horizontal Gene Transfers with Perfect Transfer Networks
López Sánchez, Alitzel
1
Lafond, Manuel
1
Computer Science Department, Université de Sherbrooke, Canada
Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity.
We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa. This problem has been studied extensively in the form of ancestral recombination networks, but these only model hybridation events and do not differentiate between direct parents and lateral donors. We focus on tree-based networks, in which edges representing vertical descent are clearly distinguished from those that represent horizontal transmission. Our model is a direct generalization of perfect phylogeny models to such networks. Our goal is to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.3/LIPIcs.WABI.2022.3.pdf
Horizontal gene transfer
tree-based networks
perfect phylogenies
character-based
gene-expression
indirect phylogenetic methods
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
4:1
4:14
10.4230/LIPIcs.WABI.2022.4
article
Haplotype Threading Using the Positional Burrows-Wheeler Transform
Sanaullah, Ahsan
1
Zhi, Degui
2
Zhang, Shaoije
1
Department of Computer Science, University of Central Florida, Orlando, FL, USA
School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
In the classic model of population genetics, one haplotype (query) is considered as a mosaic copy of segments from a number of haplotypes in a panel, or threading the haplotype through the panel. The Li and Stephens model parameterized this problem using a hidden Markov model (HMM). However, HMM algorithms are linear to the sample size, and can be very expensive for biobank-scale panels. Here, we formulate the haplotype threading problem as the Minimal Positional Substring Cover problem, where a query is represented by a mosaic of a minimal number of substring matches from the panel. We show that this problem can be solved by a sequential set of greedy set maximal matches. Moreover, the solution space can be bounded by the left-most and the right-most solutions by the greedy approach. Based on these results, we formulate and solve several variations of this problem. Although our results are yet to be generalized to the cases with mismatches, they offer a theoretical framework for designing methods for genotype imputation and haplotype phasing.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.4/LIPIcs.WABI.2022.4.pdf
Substring Cover
PBWT
Haplotype Threading
Haplotype Matching
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
5:1
5:20
10.4230/LIPIcs.WABI.2022.5
article
Non-Binary Tree Reconciliation with Endosymbiotic Gene Transfer
Gascon, Mathieu
1
El-Mabrouk, Nadia
2
Département d'informatique et de recherche opérationnelle (DIRO), Université de Montréal, Canada
DIRO, Université de Montréal, Canada
Gene transfer between the mitochondrial and nuclear genome of the same species, called endosymbiotic gene transfer (EGT), is a mechanism which has largely shaped gene contents in eukaryotes since a unique ancestral endosymbiotic event know to be at the origin of all mitochondria. The gene tree-species tree reconciliation model has been recently extended to account for EGTs: given a binary gene tree and a binary species tree, the EndoRex software outputs an optimal DLE-Reconciliation, that is an embedding of the gene tree into the species tree inducing a most parsimonious history of Duplications, Losses and EGT events. Here, we provide the first algorithmic study for DLE-Reconciliation in the case of a multifurcated (non-binary) gene tree. We present a general two-steps method: first, ignoring the mitochondrial-nuclear (or 0-1) labeling of leaves, output a binary resolution minimizing the DL-Reconciliation and, for each resolution, assign a known number of 0s and 1s to the leaves in a way minimizing EGT events. While Step 1 corresponds to the well studied non-binary DL-Reconciliation problem, the complexity of the formal label assignment problem related to Step 2 is unknown. Here, we show it is NP-complete even for a single polytomy (non-binary node). We then provide a heuristic which is exact for the unitary cost of operations, and a polynomial-time algorithm for solving a polytomy in the special case where genes are specific to a single genome (mitochondrial or nuclear) in all but one species.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.5/LIPIcs.WABI.2022.5.pdf
Reconciliation
Duplication
Endosymbiotic gene transfer
Multifurcated gene tree
Polytomy
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
6:1
6:23
10.4230/LIPIcs.WABI.2022.6
article
Constructing Founder Sets Under Allelic and Non-Allelic Homologous Recombination
Bonnet, Konstantinn
1
Marschall, Tobias
1
https://orcid.org/0000-0002-9376-1030
Doerr¹, Daniel
1
https://orcid.org/0000-0002-3720-6227
Institute for Medical Biometry and Bioinformatics, Heinrich Heine University, Düsseldorf, Germany
Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements - including deletion, duplication, and inversion - and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR.
In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where human haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, and demonstrate scalability to problem instances arising in practice.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.6/LIPIcs.WABI.2022.6.pdf
founder set reconstruction
variation graph
pangenomics
NAHR
homologous recombination
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
7:1
7:24
10.4230/LIPIcs.WABI.2022.7
article
Automated Design of Dynamic Programming Schemes for RNA Folding with Pseudoknots
Marchand, Bertrand
1
2
https://orcid.org/0000-0001-8060-6640
Will, Sebastian
1
https://orcid.org/0000-0002-2376-9205
Berkemer, Sarah J.
1
https://orcid.org/0000-0003-2028-7670
Bulteau, Laurent
2
https://orcid.org/0000-0003-1645-9345
Ponty, Yann
1
https://orcid.org/0000-0002-7615-3930
LIX (UMR 7161), Ecole Polytechnique, Institut Polytechnique de Paris, France
LIGM, CNRS, Univ Gustave Eiffel, F77454 Marne-la-vallée France
Despite being a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, RNA secondary structure prediction remains challenging whenever pseudoknots come into play. To circumvent the NP-hardness of energy minimization in realistic energy models, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations.
While these methods rely on hand-crafted DP schemes, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. We formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the tree-width tw of the fatgraph, and its output represents a 𝒪(n^{tw+1}) algorithm for predicting the MFE folding of an RNA of length n.
Our general framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.7/LIPIcs.WABI.2022.7.pdf
RNA folding
treewidth
dynamic programming
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
8:1
8:24
10.4230/LIPIcs.WABI.2022.8
article
Fast and Accurate Species Trees from Weighted Internode Distances
Liu, Baqiao
1
https://orcid.org/0000-0002-4210-8269
Warnow, Tandy
1
https://orcid.org/0000-0001-7717-3514
Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA
Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. Our experimental study evaluating weighted ASTRID shows improvements in accuracy compared to the original (unweighted) ASTRID while remaining fast. Moreover, weighted ASTRID shows competitive accuracy against weighted ASTRAL, the state of the art. Thus, this study provides a new and very fast method for species tree estimation that improves upon ASTRID and has comparable accuracy with the state of the art while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.8/LIPIcs.WABI.2022.8.pdf
Species tree estimation
ASTRID
ASTRAL
multi-species coalescent
incomplete lineage sorting
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
9:1
9:20
10.4230/LIPIcs.WABI.2022.9
article
On Weighted k-mer Dictionaries
Pibiri, Giulio Ermanno
1
2
https://orcid.org/0000-0003-0724-7092
Ca' Foscari University of Venice, Venice, Italy
ISTI-CNR, Pisa, Italy
We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing.
In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.9/LIPIcs.WABI.2022.9.pdf
K-Mers
Weights
Compression
Hashing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
10:1
10:20
10.4230/LIPIcs.WABI.2022.10
article
Accurate k-mer Classification Using Read Profiles
Suzuki, Yoshihiko
1
https://orcid.org/0000-0002-8807-2206
Myers, Gene
1
2
3
https://orcid.org/0000-0002-6580-7839
Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
Center for Systems Biology Dresden, Dresden, Germany
Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.10/LIPIcs.WABI.2022.10.pdf
K-mer
K-mer count
K-mer classification
HiFi sequencing
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
11:1
11:19
10.4230/LIPIcs.WABI.2022.11
article
New Algorithms for Structure Informed Genome Rearrangement
Ozery, Eden
1
Zehavi, Meirav
1
Ziv-Ukelson, Michal
1
Ben Gurion University of the Negev, Israel
We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, Constrained TreeToString Divergence (CTTSD), we define the basic structure-informed rearrangement based divergence measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to the reference gene order. Then, a structure-informed gene rearrangement measure is computed between the ordered PQ-tree and the target gene order. The second problem, TreeToString Divergence (TTSD), generalizes CTTSD, where the gene order members are not necessarily permutations and the structure-informed rearrangement based divergence measure is extended to also consider up to d_S and d_T gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference order to the target order.
The first algorithm solves CTTSD in O(n γ² ⋅ (m_p ⋅ 1.381^γ + m_q)) time and O(n²) space, where γ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and m_p and m_q are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of CTTSD is 0, then the algorithm runs in O(n m γ²) time and O(n²) space. The second algorithm solves TTSD in O(n² γ² {d_T}² {d_S}² m² (m_p ⋅ 5^γ γ + m_q)) time and O(d_T d_S m (m n + 5^γ)) space, where γ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, m_p and m_q are the number of P-nodes and Q-nodes in the tree, respectively, and allowing d_T deletions from the tree and d_S deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of TTSD is 0) in O(n γ² {d_T}² {d_S}² m² (m_p ⋅ 4^γ γ²n(d_T+d_S+m+n) + m_q)) time and O(γ² n m² d_T d_S (d_T+d_S+m+n)) space.
The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1,487 prokaryotic genomes.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.11/LIPIcs.WABI.2022.11.pdf
PQ-tree
Gene Cluster
Breakpoint Distance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
12:1
12:20
10.4230/LIPIcs.WABI.2022.12
article
Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables
Zentgraf, Jens
1
2
3
https://orcid.org/0000-0001-9444-2755
Rahmann, Sven
1
2
https://orcid.org/0000-0002-8536-6065
Department of Computer Science, Saarland University, Saarbrücken, Germany
Center for Bioinformatics, Saarland University, Saarbrücken, Germany
Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany
Motivation. In biological sequence analysis, alignment-free (also known as k-mer-based) methods are increasingly replacing mapping- and alignment-based methods for various applications. A basic step of such methods consists of building a table of all k-mers of a given set of sequences (a reference genome or a dataset of sequenced reads) and their counts. Over the past years, efficient methods and tools for k-mer counting have been developed. In a different line of work, the use of gapped k-mers has been shown to offer advantages over the use of the standard contiguous k-mers. However, no tool seems to be available that is able to count gapped k-mers with the same efficiency as contiguous k-mers. One reason is that the most efficient k-mer counters use minimizers (of a length m < k) to group k-mers into buckets, such that many consecutive k-mers are classified into the same bucket. This approach leads to cache-friendly (and hence extremely fast) algorithms, but the approach does not transfer easily to gapped k-mers. Consequently, the existing efficient k-mer counters cannot be trivially modified to count gapped k-mers with the same efficiency.
Results. We present a different approach that is equally applicable to contiguous k-mers and gapped k-mers. We use multi-way bucketed Cuckoo hash tables to efficiently store (gapped) k-mers and their counts. We also describe a method to parallelize counting over multiple threads without using locks: We subdivide the hash table into independent subtables, and use a producer-consumer model, such that each thread serves one subtable. This requires designing Cuckoo hash functions with the property that all alternative locations for each k-mer are located in the same subtable. Compared to some of the fastest contiguous k-mer counters, our approach is of comparable speed, or even faster, on large datasets, and it is the only one that supports gapped k-mers.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.12/LIPIcs.WABI.2022.12.pdf
gapped k-mer
k-mer
counting
Cuckoo hashing
parallelization
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
13:1
13:16
10.4230/LIPIcs.WABI.2022.13
article
A Linear Time Algorithm for an Extended Version of the Breakpoint Double Distance
Braga, Marília D. V.
1
https://orcid.org/0000-0002-4092-2646
Brockmann, Leonie R.
1
Klerx, Katharina
1
Stoye, Jens
1
https://orcid.org/0000-0002-4656-7155
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany
Two genomes over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. A genome is circular when it contains only circular chromosomes. Different distances of canonical circular genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length. Then, the breakpoint distance is equal to n-c_2, where n is the number of genes and c_2 is the number of cycles of length 2. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance is n-c, where c is the total number of cycles.
The distance problem is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a σ_k distance, defined to be n-(c_2+c_4+…+c_k), and increasingly investigate the complexities of median and double distance for the σ₄ distance, then the σ₆ distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the σ₄ distance, for solving the double distance under σ₄ and σ₆ distances we could devise linear time algorithms, which we present here.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.13/LIPIcs.WABI.2022.13.pdf
Comparative genomics
genome rearrangement
breakpoint distance
double-cut-and-join (DCJ) distance
double distance
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
14:1
14:14
10.4230/LIPIcs.WABI.2022.14
article
Efficient Reconciliation of Genomic Datasets of High Similarity
Shibuya, Yoshihiro
1
https://orcid.org/0000-0002-3137-1504
Belazzougui, Djamal
2
Kucherov, Gregory
3
https://orcid.org/0000-0001-5899-5424
LIGM, Université Gustave Eiffel, Marne-la-Vallée, France
CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
LIGM, CNRS, Université Gustave Eiffel, Marne-la-Vallée, France
We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.14/LIPIcs.WABI.2022.14.pdf
k-mers
sketching
Invertible Bloom Lookup Tables
IBLT
MinHash
syncmers
minimizers
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
15:1
15:22
10.4230/LIPIcs.WABI.2022.15
article
WGSUniFrac: Applying UniFrac Metric to Whole Genome Shotgun Data
Wei, Wei
1
https://orcid.org/0000-0002-2024-7233
Koslicki, David
2
3
4
https://orcid.org/0000-0002-0640-954X
The Pennsylvania State University, University Park, PA, USA
Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
Department of Biology, The Pennsylvania State University, University Park, PA, USA
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
The UniFrac metric has proven useful in revealing diversity across metagenomic communities. Due to the phylogeny-based nature of this measurement, UniFrac has historically only been applied to 16S rRNA data. Simultaneously, Whole Genome Shotgun (WGS) metagenomics has been increasingly widely employed and proven to provide more information than 16S data, but a UniFrac-like diversity metric suitable for WGS data has not previously been developed. The main obstacle for UniFrac to be applied directly to WGS data is the absence of phylogenetic distances in the taxonomic relationship derived from WGS data. In this study, we demonstrate a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles. We conduct a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.15/LIPIcs.WABI.2022.15.pdf
UniFrac
beta-diversity
Whole Genome Shotgun
microbial community similarity
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
16:1
16:22
10.4230/LIPIcs.WABI.2022.16
article
Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning
Bernardini, Giulia
1
2
https://orcid.org/0000-0001-6647-088X
van Iersel, Leo
3
Julien, Esther
3
Stougie, Leen
4
5
https://orcid.org/0000-0001-6938-8902
University of Trieste, Italy
CWI, Amsterdam, The Netherlands
Delft Institute of Applied Mathematics, Delft University of Technology, The Netherlands
CWI and Vrije Universiteit, Amsterdam, The Netherlands
Erable, France
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.16/LIPIcs.WABI.2022.16.pdf
Phylogenetics
Hybridization
Cherry Picking
Machine Learning
Heuristic
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
17:1
17:16
10.4230/LIPIcs.WABI.2022.17
article
Feasibility of Flow Decomposition with Subpath Constraints in Linear Time
Gibney, Daniel
1
Thankachan, Sharma V.
2
Aluru, Srinivas
1
Georgia Institute of Technology, Atlanta, GA, USA
North Carolina State University, Raleigh, NC, USA
The decomposition of flow-networks is an essential part of many transcriptome assembly algorithms used in Computational Biology. The addition of subpath constraints to this decomposition appeared recently as an effective way to incorporate longer, already known, portions of the transcript. The problem is defined as follows: given a weakly connected directed acyclic flow network G = (V, E, f) and a set ℛ of subpaths in G, find a flow decomposition so that every subpath in ℛ is included in some flow in the decomposition [Williams et al., WABI 2021]. The authors of that work presented an exponential time algorithm for determining the feasibility of such a flow decomposition, and more recently presented an O(|E| + L+|ℛ|³) time algorithm, where L is the sum of the path lengths in ℛ [Williams et al., TCBB 2022]. Our work provides an improved, linear O(|E| + L) time algorithm for determining the feasibility of such a flow decomposition. We also introduce two natural optimization variants of the feasibility problem: (i) determining the minimum sized subset of ℛ that must be removed to make a flow decomposition feasible, and (ii) determining the maximum sized subset of ℛ that can be maintained while making a flow decomposition feasible. We show that, under the assumption P ≠ NP, (i) does not admit a polynomial-time o(log |V|)-approximation algorithm and (ii) does not admit a polynomial-time O(|V|^{1/2-ε} + |ℛ|^{1-ε})-approximation algorithm for any constant ε > 0.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.17/LIPIcs.WABI.2022.17.pdf
Flow networks
flow decomposition
subpath constraints
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
18:1
18:12
10.4230/LIPIcs.WABI.2022.18
article
Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs
Goga, Adrián
1
Baláž, Andrej
2
Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia
Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia
We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019).
Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process.
To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary.
To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.18/LIPIcs.WABI.2022.18.pdf
Wheeler graphs
BWT tunnelling
prefix-free parsing
pangenomic graphs
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
19:1
19:17
10.4230/LIPIcs.WABI.2022.19
article
Pangenomic Genotyping with the Marker Array
Mun, Taher
1
2
https://orcid.org/0000-0002-3588-0883
Vaddadi, Naga Sai Kavya
1
Langmead, Ben
3
https://orcid.org/0000-0003-2437-1976
Johns Hopkins University, Baltimore MD, USA
Illumina, San Diego, USA
Johns Hopkins University, USA
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.19/LIPIcs.WABI.2022.19.pdf
Sequence alignment indexing genotyping
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
20:1
20:15
10.4230/LIPIcs.WABI.2022.20
article
Suffix Sorting via Matching Statistics
Lipták, Zsuzsanna
1
https://orcid.org/0000-0002-3233-0691
Masillo, Francesco
1
https://orcid.org/0000-0002-2078-6835
Puglisi, Simon J.
2
3
https://orcid.org/0000-0001-7668-7636
Department of Computer Science, University of Verona, Italy
Helsinki Institute for Information Technology (HIIT), Finland
Department of Computer Science, University of Helsinki, Finland
We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.20/LIPIcs.WABI.2022.20.pdf
Generalized suffix array
matching statistics
string collections
compressed representation
data structures
efficient algorithms
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
21:1
21:22
10.4230/LIPIcs.WABI.2022.21
article
A Maximum Parsimony Principle for Multichromosomal Complex Genome Rearrangements
Simonaitis, Pijus
1
https://orcid.org/0000-0003-3576-8098
Raphael, Benjamin J.
1
https://orcid.org/0000-0003-1274-048X
Department of Computer Science, Princeton University, Princeton, NJ, USA
Motivation. Complex genome rearrangements, such as chromothripsis and chromoplexy, are common in cancer and have also been reported in individuals with various developmental and neurological disorders. These mutations are proposed to involve simultaneous breakage of the genome at many loci and rejoining of these breaks that produce highly rearranged genomes. Since genome sequencing measures only the novel adjacencies present at the time of sequencing, determining whether a collection of novel adjacencies resulted from a complex rearrangement is a complicated and ill-posed problem. Current heuristics for this problem often result in the inference of complex rearrangements that affect many chromosomes.
Results. We introduce a model for complex rearrangements that builds upon the methods developed for analyzing simple genome rearrangements such as inversions and translocations. While nearly all of these existing methods use a maximum parsimony assumption of minimizing the number of rearrangements, we propose an alternative maximum parsimony principle based on minimizing the number of chromosomes involved in a rearrangement scenario. We show that our model leads to inference of more plausible sequences of rearrangements that better explain a complex congenital rearrangement in a human genome and chromothripsis events in 22 cancer genomes.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.21/LIPIcs.WABI.2022.21.pdf
Genome rearrangements
maximum parsimony
cancer evolution
chromothripsis
structural variation
affected chromosomes
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
22:1
22:14
10.4230/LIPIcs.WABI.2022.22
article
Locality-Sensitive Bucketing Functions for the Edit Distance
Chen, Ke
1
https://orcid.org/0000-0001-5470-6621
Shao, Mingfu
1
2
https://orcid.org/0000-0001-6112-5139
Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, United States
Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.22/LIPIcs.WABI.2022.22.pdf
Locality-sensitive hashing
locality-sensitive bucketing
long reads
embedding
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
23:1
23:19
10.4230/LIPIcs.WABI.2022.23
article
phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering
Guerrini, Veronica
1
https://orcid.org/0000-0001-8888-9243
Conte, Alessio
1
https://orcid.org/0000-0003-0770-2235
Grossi, Roberto
1
https://orcid.org/0000-0002-7985-4222
Liti, Gianni
2
https://orcid.org/0000-0002-2318-0775
Rosone, Giovanna
1
https://orcid.org/0000-0001-5075-1214
Tattini, Lorenzo
2
https://orcid.org/0000-0002-5477-084X
Dipartimento di Informatica, University of Pisa, Italy
CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France
Molecular phylogenetics is a fundamental branch of biology. It studies the evolutionary relationships among the individuals of a population through their biological sequences, and may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories.
In this paper we develop a method called phyBWT, describing how to use the extended Burrows-Wheeler Transform (eBWT) for a collection of DNA sequences to directly reconstruct phylogeny, bypassing the alignment against a reference genome or de novo assembly. Our phyBWT hinges on the combinatorial properties of the eBWT positional clustering framework. We employ eBWT to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori), and build a suitable decomposition leading to a phylogenetic tree, step by step. As a result, phyBWT is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny.
The preliminary experimental results on sequencing data show that our method can handle datasets of different types (short reads, contigs, or entire genomes), producing trees of quality comparable to that found in the benchmark phylogeny.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.23/LIPIcs.WABI.2022.23.pdf
Phylogeny
partition tree
BWT
positional cluster
alignment-free
reference-free
assembly-free
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
24:1
24:22
10.4230/LIPIcs.WABI.2022.24
article
Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes
Rubert, Diego P.
1
2
https://orcid.org/0000-0002-4131-7309
Braga, Marília D. V.
2
https://orcid.org/0000-0002-4092-2646
Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS, Brasil
Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany
Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. Although the ILP is quite efficient and could conceptually analyze genomes that are not completely assembled but split in several contigs, our tool failed in completing that task. The main reason is that each ILP pairwise comparison includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space.
In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into m ≥ 1 subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on real data show that we can now efficiently analyze fruit fly genomes with unfinished assemblies distributed in hundreds or even thousands of contigs, obtaining orthologies that are more similar to FlyBase orthologies when compared to orthologies computed by other inference tools. Moreover, for complete assemblies the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the optimal version of our tool. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.24/LIPIcs.WABI.2022.24.pdf
Comparative genomics
double-cut-and-join
indels
gene orthology
eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2022-08-26
242
25:1
25:15
10.4230/LIPIcs.WABI.2022.25
article
Toward Optimal Fingerprint Indexing for Large Scale Genomics
Agret, Clément
1
2
Cazaux, Bastien
1
Limasset, Antoine
1
Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
LIRMM, Univ Montpellier, CNRS, Montpellier, France
Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index.
Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.25/LIPIcs.WABI.2022.25.pdf
Data Structure
Indexation
Local Sensitive Hashing
Genomes
Databases