LIPIcs, Volume 242, WABI 2022, Complete Volume

eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 1 474 10.4230/LIPIcs.WABI.2022 article LIPIcs, Volume 242, WABI 2022, Complete Volume Boucher, Christina 1 https://orcid.org/0000-0001-9509-9725 Rahmann, Sven 2 https://orcid.org/0000-0002-8536-6065 Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA Department of Computer Science and Center for Bioinformatics, Saarland University, Saarbrücken, Germany LIPIcs, Volume 242, WABI 2022, Complete Volume https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022/LIPIcs.WABI.2022.pdf LIPIcs, Volume 242, WABI 2022, Complete Volume eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 0:i 0:xii 10.4230/LIPIcs.WABI.2022.0 article Front Matter, Table of Contents, Preface, Conference Organization Boucher, Christina 1 https://orcid.org/0000-0001-9509-9725 Rahmann, Sven 2 https://orcid.org/0000-0002-8536-6065 Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA Department of Computer Science and Center for Bioinformatics, Saarland University, Saarbrücken, Germany Front Matter, Table of Contents, Preface, Conference Organization https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.0/LIPIcs.WABI.2022.0.pdf Front Matter Table of Contents Preface Conference Organization eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 1:1 1:2 10.4230/LIPIcs.WABI.2022.1 article Efficient Solutions to Biological Problems Using de Bruijn Graphs (Invited Talk) Salmela, Leena 1 https://orcid.org/0000-0002-0756-543X University of Helsinki, Finland The de Bruijn graph has become a standard method in the analysis of sequencing reads in computational biology due to its ability to represent the information contained in large read sets in small space. A de Bruijn graph represents a set of sequencing reads by its k-mers, i.e. the set of substrings of length k that occur in the reads. In the classical definition, the k-mers are the edges of the graph and the nodes are the k-1 bases long prefixes and suffixes of the k-mers. Usually only k-mers occurring several times in the read set are kept to filter out noise in the data. De Bruijn graphs have been used to solve many problems in computational biology including genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001; Anton Bankevich et al., 2012; Yu Peng et al., 2010], sequencing error correction [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019], reference free variant calling [Raluca Uricaru et al., 2015], indexing read sets [Camille Marchet et al., 2021], and so on. Next I will discuss two of these problems in more depth. The de Bruijn graph first emerged in computation biology in the context of genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001] where the task is to reconstruct a genome based on sequencing reads. As the de Bruijn graph can represent large read sets compactly, it became the standard approach to assemble short reads [Anton Bankevich et al., 2012; Yu Peng et al., 2010]. In the theoretical framework of de Bruijn graph based genome assembly, a genome is thought to be the Eulerian path in the de Bruijn graph built on the sequencing reads. In practise, the Eulerian path is not unique and thus not useful in the biological context. Therefore, practical implementations report subpaths that are guaranteed to be part of any Eulerian path and thus part of the actual genome. Such models include unitigs, which are nonbranching paths of the de Bruijn graph, and more involved definitions such as omnitigs [Alexandru I. Tomescu and Paul Medvedev, 2017]. In genome assembly the choice of k is a crucial matter. A small k can result in a tangled graph, whereas a too large k will fragment the graph. Furthermore, a different value of k may be optimal for different parts of the genome. Variable order de Bruijn graphs [Christina Boucher et al., 2015; Djamal Belazzougui et al., 2016], which represent de Bruijn graphs of all orders k in a single data structure, have been proposed as a solution but no rigorous definition corresponding to unitigs has been presented. We give the first definition of assembled sequences, i.e. contigs, on such graphs and an algorithm for enumerating them. Another problem that can be solved with de Bruijn graphs is the correction of sequencing errors [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019]. Because each position of a genome is sequenced several times, it is possible to correct sequencing errors in reads if we can identify data originating from the same genomic region. A de Bruijn graph can be used to represent compactly the reliable information and the individual reads can be corrected by aligning them to the graph. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.1/LIPIcs.WABI.2022.1.pdf de Bruijn graph variable order de Bruijn graph genome assembly sequencing error correction k-mers eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 2:1 2:21 10.4230/LIPIcs.WABI.2022.2 article Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time Schmidt, Sebastian 1 https://orcid.org/0000-0003-4878-2809 Alanko, Jarno N. 1 https://orcid.org/0000-0002-8003-9225 University of Helsinki, Finland A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. For that, we present a formalisation of arc-centric bidirected de Bruijn graphs and carefully prove that it accurately models the k-mer spectrum of the input. Our algorithm first constructs the de Bruijn graph in linear time in the length of the input strings (for a fixed-size alphabet). Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.2/LIPIcs.WABI.2022.2.pdf Spectrum preserving string sets Eulerian cycle Suffix tree Bidirected arc-centric de Bruijn graph k-mer based methods eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 3:1 3:22 10.4230/LIPIcs.WABI.2022.3 article Predicting Horizontal Gene Transfers with Perfect Transfer Networks López Sánchez, Alitzel 1 Lafond, Manuel 1 Computer Science Department, Université de Sherbrooke, Canada Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity. We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa. This problem has been studied extensively in the form of ancestral recombination networks, but these only model hybridation events and do not differentiate between direct parents and lateral donors. We focus on tree-based networks, in which edges representing vertical descent are clearly distinguished from those that represent horizontal transmission. Our model is a direct generalization of perfect phylogeny models to such networks. Our goal is to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.3/LIPIcs.WABI.2022.3.pdf Horizontal gene transfer tree-based networks perfect phylogenies character-based gene-expression indirect phylogenetic methods eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 4:1 4:14 10.4230/LIPIcs.WABI.2022.4 article Haplotype Threading Using the Positional Burrows-Wheeler Transform Sanaullah, Ahsan 1 Zhi, Degui 2 Zhang, Shaoije 1 Department of Computer Science, University of Central Florida, Orlando, FL, USA School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA In the classic model of population genetics, one haplotype (query) is considered as a mosaic copy of segments from a number of haplotypes in a panel, or threading the haplotype through the panel. The Li and Stephens model parameterized this problem using a hidden Markov model (HMM). However, HMM algorithms are linear to the sample size, and can be very expensive for biobank-scale panels. Here, we formulate the haplotype threading problem as the Minimal Positional Substring Cover problem, where a query is represented by a mosaic of a minimal number of substring matches from the panel. We show that this problem can be solved by a sequential set of greedy set maximal matches. Moreover, the solution space can be bounded by the left-most and the right-most solutions by the greedy approach. Based on these results, we formulate and solve several variations of this problem. Although our results are yet to be generalized to the cases with mismatches, they offer a theoretical framework for designing methods for genotype imputation and haplotype phasing. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.4/LIPIcs.WABI.2022.4.pdf Substring Cover PBWT Haplotype Threading Haplotype Matching eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 5:1 5:20 10.4230/LIPIcs.WABI.2022.5 article Non-Binary Tree Reconciliation with Endosymbiotic Gene Transfer Gascon, Mathieu 1 El-Mabrouk, Nadia 2 Département d'informatique et de recherche opérationnelle (DIRO), Université de Montréal, Canada DIRO, Université de Montréal, Canada Gene transfer between the mitochondrial and nuclear genome of the same species, called endosymbiotic gene transfer (EGT), is a mechanism which has largely shaped gene contents in eukaryotes since a unique ancestral endosymbiotic event know to be at the origin of all mitochondria. The gene tree-species tree reconciliation model has been recently extended to account for EGTs: given a binary gene tree and a binary species tree, the EndoRex software outputs an optimal DLE-Reconciliation, that is an embedding of the gene tree into the species tree inducing a most parsimonious history of Duplications, Losses and EGT events. Here, we provide the first algorithmic study for DLE-Reconciliation in the case of a multifurcated (non-binary) gene tree. We present a general two-steps method: first, ignoring the mitochondrial-nuclear (or 0-1) labeling of leaves, output a binary resolution minimizing the DL-Reconciliation and, for each resolution, assign a known number of 0s and 1s to the leaves in a way minimizing EGT events. While Step 1 corresponds to the well studied non-binary DL-Reconciliation problem, the complexity of the formal label assignment problem related to Step 2 is unknown. Here, we show it is NP-complete even for a single polytomy (non-binary node). We then provide a heuristic which is exact for the unitary cost of operations, and a polynomial-time algorithm for solving a polytomy in the special case where genes are specific to a single genome (mitochondrial or nuclear) in all but one species. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.5/LIPIcs.WABI.2022.5.pdf Reconciliation Duplication Endosymbiotic gene transfer Multifurcated gene tree Polytomy eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 6:1 6:23 10.4230/LIPIcs.WABI.2022.6 article Constructing Founder Sets Under Allelic and Non-Allelic Homologous Recombination Bonnet, Konstantinn 1 Marschall, Tobias 1 https://orcid.org/0000-0002-9376-1030 Doerr¹, Daniel 1 https://orcid.org/0000-0002-3720-6227 Institute for Medical Biometry and Bioinformatics, Heinrich Heine University, Düsseldorf, Germany Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements - including deletion, duplication, and inversion - and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where human haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, and demonstrate scalability to problem instances arising in practice. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.6/LIPIcs.WABI.2022.6.pdf founder set reconstruction variation graph pangenomics NAHR homologous recombination eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 7:1 7:24 10.4230/LIPIcs.WABI.2022.7 article Automated Design of Dynamic Programming Schemes for RNA Folding with Pseudoknots Marchand, Bertrand 1 2 https://orcid.org/0000-0001-8060-6640 Will, Sebastian 1 https://orcid.org/0000-0002-2376-9205 Berkemer, Sarah J. 1 https://orcid.org/0000-0003-2028-7670 Bulteau, Laurent 2 https://orcid.org/0000-0003-1645-9345 Ponty, Yann 1 https://orcid.org/0000-0002-7615-3930 LIX (UMR 7161), Ecole Polytechnique, Institut Polytechnique de Paris, France LIGM, CNRS, Univ Gustave Eiffel, F77454 Marne-la-vallée France Despite being a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, RNA secondary structure prediction remains challenging whenever pseudoknots come into play. To circumvent the NP-hardness of energy minimization in realistic energy models, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. While these methods rely on hand-crafted DP schemes, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. We formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the tree-width tw of the fatgraph, and its output represents a 𝒪(n^{tw+1}) algorithm for predicting the MFE folding of an RNA of length n. Our general framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.7/LIPIcs.WABI.2022.7.pdf RNA folding treewidth dynamic programming eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 8:1 8:24 10.4230/LIPIcs.WABI.2022.8 article Fast and Accurate Species Trees from Weighted Internode Distances Liu, Baqiao 1 https://orcid.org/0000-0002-4210-8269 Warnow, Tandy 1 https://orcid.org/0000-0001-7717-3514 Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. Our experimental study evaluating weighted ASTRID shows improvements in accuracy compared to the original (unweighted) ASTRID while remaining fast. Moreover, weighted ASTRID shows competitive accuracy against weighted ASTRAL, the state of the art. Thus, this study provides a new and very fast method for species tree estimation that improves upon ASTRID and has comparable accuracy with the state of the art while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.8/LIPIcs.WABI.2022.8.pdf Species tree estimation ASTRID ASTRAL multi-species coalescent incomplete lineage sorting eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 9:1 9:20 10.4230/LIPIcs.WABI.2022.9 article On Weighted k-mer Dictionaries Pibiri, Giulio Ermanno 1 2 https://orcid.org/0000-0003-0724-7092 Ca' Foscari University of Venice, Venice, Italy ISTI-CNR, Pisa, Italy We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.9/LIPIcs.WABI.2022.9.pdf K-Mers Weights Compression Hashing eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 10:1 10:20 10.4230/LIPIcs.WABI.2022.10 article Accurate k-mer Classification Using Read Profiles Suzuki, Yoshihiko 1 https://orcid.org/0000-0002-8807-2206 Myers, Gene 1 2 3 https://orcid.org/0000-0002-6580-7839 Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany Center for Systems Biology Dresden, Dresden, Germany Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.10/LIPIcs.WABI.2022.10.pdf K-mer K-mer count K-mer classification HiFi sequencing eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 11:1 11:19 10.4230/LIPIcs.WABI.2022.11 article New Algorithms for Structure Informed Genome Rearrangement Ozery, Eden 1 Zehavi, Meirav 1 Ziv-Ukelson, Michal 1 Ben Gurion University of the Negev, Israel We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, Constrained TreeToString Divergence (CTTSD), we define the basic structure-informed rearrangement based divergence measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to the reference gene order. Then, a structure-informed gene rearrangement measure is computed between the ordered PQ-tree and the target gene order. The second problem, TreeToString Divergence (TTSD), generalizes CTTSD, where the gene order members are not necessarily permutations and the structure-informed rearrangement based divergence measure is extended to also consider up to d_S and d_T gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference order to the target order. The first algorithm solves CTTSD in O(n γ² ⋅ (m_p ⋅ 1.381^γ + m_q)) time and O(n²) space, where γ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and m_p and m_q are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of CTTSD is 0, then the algorithm runs in O(n m γ²) time and O(n²) space. The second algorithm solves TTSD in O(n² γ² {d_T}² {d_S}² m² (m_p ⋅ 5^γ γ + m_q)) time and O(d_T d_S m (m n + 5^γ)) space, where γ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, m_p and m_q are the number of P-nodes and Q-nodes in the tree, respectively, and allowing d_T deletions from the tree and d_S deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of TTSD is 0) in O(n γ² {d_T}² {d_S}² m² (m_p ⋅ 4^γ γ²n(d_T+d_S+m+n) + m_q)) time and O(γ² n m² d_T d_S (d_T+d_S+m+n)) space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1,487 prokaryotic genomes. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.11/LIPIcs.WABI.2022.11.pdf PQ-tree Gene Cluster Breakpoint Distance eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 12:1 12:20 10.4230/LIPIcs.WABI.2022.12 article Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables Zentgraf, Jens 1 2 3 https://orcid.org/0000-0001-9444-2755 Rahmann, Sven 1 2 https://orcid.org/0000-0002-8536-6065 Department of Computer Science, Saarland University, Saarbrücken, Germany Center for Bioinformatics, Saarland University, Saarbrücken, Germany Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany Motivation. In biological sequence analysis, alignment-free (also known as k-mer-based) methods are increasingly replacing mapping- and alignment-based methods for various applications. A basic step of such methods consists of building a table of all k-mers of a given set of sequences (a reference genome or a dataset of sequenced reads) and their counts. Over the past years, efficient methods and tools for k-mer counting have been developed. In a different line of work, the use of gapped k-mers has been shown to offer advantages over the use of the standard contiguous k-mers. However, no tool seems to be available that is able to count gapped k-mers with the same efficiency as contiguous k-mers. One reason is that the most efficient k-mer counters use minimizers (of a length m < k) to group k-mers into buckets, such that many consecutive k-mers are classified into the same bucket. This approach leads to cache-friendly (and hence extremely fast) algorithms, but the approach does not transfer easily to gapped k-mers. Consequently, the existing efficient k-mer counters cannot be trivially modified to count gapped k-mers with the same efficiency. Results. We present a different approach that is equally applicable to contiguous k-mers and gapped k-mers. We use multi-way bucketed Cuckoo hash tables to efficiently store (gapped) k-mers and their counts. We also describe a method to parallelize counting over multiple threads without using locks: We subdivide the hash table into independent subtables, and use a producer-consumer model, such that each thread serves one subtable. This requires designing Cuckoo hash functions with the property that all alternative locations for each k-mer are located in the same subtable. Compared to some of the fastest contiguous k-mer counters, our approach is of comparable speed, or even faster, on large datasets, and it is the only one that supports gapped k-mers. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.12/LIPIcs.WABI.2022.12.pdf gapped k-mer k-mer counting Cuckoo hashing parallelization eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 13:1 13:16 10.4230/LIPIcs.WABI.2022.13 article A Linear Time Algorithm for an Extended Version of the Breakpoint Double Distance Braga, Marília D. V. 1 https://orcid.org/0000-0002-4092-2646 Brockmann, Leonie R. 1 Klerx, Katharina 1 Stoye, Jens 1 https://orcid.org/0000-0002-4656-7155 Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany Two genomes over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. A genome is circular when it contains only circular chromosomes. Different distances of canonical circular genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length. Then, the breakpoint distance is equal to n-c_2, where n is the number of genes and c_2 is the number of cycles of length 2. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance is n-c, where c is the total number of cycles. The distance problem is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a σ_k distance, defined to be n-(c_2+c_4+…+c_k), and increasingly investigate the complexities of median and double distance for the σ₄ distance, then the σ₆ distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the σ₄ distance, for solving the double distance under σ₄ and σ₆ distances we could devise linear time algorithms, which we present here. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.13/LIPIcs.WABI.2022.13.pdf Comparative genomics genome rearrangement breakpoint distance double-cut-and-join (DCJ) distance double distance eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 14:1 14:14 10.4230/LIPIcs.WABI.2022.14 article Efficient Reconciliation of Genomic Datasets of High Similarity Shibuya, Yoshihiro 1 https://orcid.org/0000-0002-3137-1504 Belazzougui, Djamal 2 Kucherov, Gregory 3 https://orcid.org/0000-0001-5899-5424 LIGM, Université Gustave Eiffel, Marne-la-Vallée, France CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria LIGM, CNRS, Université Gustave Eiffel, Marne-la-Vallée, France We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes). https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.14/LIPIcs.WABI.2022.14.pdf k-mers sketching Invertible Bloom Lookup Tables IBLT MinHash syncmers minimizers eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 15:1 15:22 10.4230/LIPIcs.WABI.2022.15 article WGSUniFrac: Applying UniFrac Metric to Whole Genome Shotgun Data Wei, Wei 1 https://orcid.org/0000-0002-2024-7233 Koslicki, David 2 3 4 https://orcid.org/0000-0002-0640-954X The Pennsylvania State University, University Park, PA, USA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA Department of Biology, The Pennsylvania State University, University Park, PA, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA The UniFrac metric has proven useful in revealing diversity across metagenomic communities. Due to the phylogeny-based nature of this measurement, UniFrac has historically only been applied to 16S rRNA data. Simultaneously, Whole Genome Shotgun (WGS) metagenomics has been increasingly widely employed and proven to provide more information than 16S data, but a UniFrac-like diversity metric suitable for WGS data has not previously been developed. The main obstacle for UniFrac to be applied directly to WGS data is the absence of phylogenetic distances in the taxonomic relationship derived from WGS data. In this study, we demonstrate a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles. We conduct a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.15/LIPIcs.WABI.2022.15.pdf UniFrac beta-diversity Whole Genome Shotgun microbial community similarity eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 16:1 16:22 10.4230/LIPIcs.WABI.2022.16 article Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning Bernardini, Giulia 1 2 https://orcid.org/0000-0001-6647-088X van Iersel, Leo 3 Julien, Esther 3 Stougie, Leen 4 5 https://orcid.org/0000-0001-6938-8902 University of Trieste, Italy CWI, Amsterdam, The Netherlands Delft Institute of Applied Mathematics, Delft University of Technology, The Netherlands CWI and Vrije Universiteit, Amsterdam, The Netherlands Erable, France Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.16/LIPIcs.WABI.2022.16.pdf Phylogenetics Hybridization Cherry Picking Machine Learning Heuristic eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 17:1 17:16 10.4230/LIPIcs.WABI.2022.17 article Feasibility of Flow Decomposition with Subpath Constraints in Linear Time Gibney, Daniel 1 Thankachan, Sharma V. 2 Aluru, Srinivas 1 Georgia Institute of Technology, Atlanta, GA, USA North Carolina State University, Raleigh, NC, USA The decomposition of flow-networks is an essential part of many transcriptome assembly algorithms used in Computational Biology. The addition of subpath constraints to this decomposition appeared recently as an effective way to incorporate longer, already known, portions of the transcript. The problem is defined as follows: given a weakly connected directed acyclic flow network G = (V, E, f) and a set ℛ of subpaths in G, find a flow decomposition so that every subpath in ℛ is included in some flow in the decomposition [Williams et al., WABI 2021]. The authors of that work presented an exponential time algorithm for determining the feasibility of such a flow decomposition, and more recently presented an O(|E| + L+|ℛ|³) time algorithm, where L is the sum of the path lengths in ℛ [Williams et al., TCBB 2022]. Our work provides an improved, linear O(|E| + L) time algorithm for determining the feasibility of such a flow decomposition. We also introduce two natural optimization variants of the feasibility problem: (i) determining the minimum sized subset of ℛ that must be removed to make a flow decomposition feasible, and (ii) determining the maximum sized subset of ℛ that can be maintained while making a flow decomposition feasible. We show that, under the assumption P ≠ NP, (i) does not admit a polynomial-time o(log |V|)-approximation algorithm and (ii) does not admit a polynomial-time O(|V|^{1/2-ε} + |ℛ|^{1-ε})-approximation algorithm for any constant ε > 0. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.17/LIPIcs.WABI.2022.17.pdf Flow networks flow decomposition subpath constraints eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 18:1 18:12 10.4230/LIPIcs.WABI.2022.18 article Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs Goga, Adrián 1 Baláž, Andrej 2 Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.18/LIPIcs.WABI.2022.18.pdf Wheeler graphs BWT tunnelling prefix-free parsing pangenomic graphs eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 19:1 19:17 10.4230/LIPIcs.WABI.2022.19 article Pangenomic Genotyping with the Marker Array Mun, Taher 1 2 https://orcid.org/0000-0002-3588-0883 Vaddadi, Naga Sai Kavya 1 Langmead, Ben 3 https://orcid.org/0000-0003-2437-1976 Johns Hopkins University, Baltimore MD, USA Illumina, San Diego, USA Johns Hopkins University, USA We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.19/LIPIcs.WABI.2022.19.pdf Sequence alignment indexing genotyping eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 20:1 20:15 10.4230/LIPIcs.WABI.2022.20 article Suffix Sorting via Matching Statistics Lipták, Zsuzsanna 1 https://orcid.org/0000-0002-3233-0691 Masillo, Francesco 1 https://orcid.org/0000-0002-2078-6835 Puglisi, Simon J. 2 3 https://orcid.org/0000-0001-7668-7636 Department of Computer Science, University of Verona, Italy Helsinki Institute for Information Technology (HIIT), Finland Department of Computer Science, University of Helsinki, Finland We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.20/LIPIcs.WABI.2022.20.pdf Generalized suffix array matching statistics string collections compressed representation data structures efficient algorithms eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 21:1 21:22 10.4230/LIPIcs.WABI.2022.21 article A Maximum Parsimony Principle for Multichromosomal Complex Genome Rearrangements Simonaitis, Pijus 1 https://orcid.org/0000-0003-3576-8098 Raphael, Benjamin J. 1 https://orcid.org/0000-0003-1274-048X Department of Computer Science, Princeton University, Princeton, NJ, USA Motivation. Complex genome rearrangements, such as chromothripsis and chromoplexy, are common in cancer and have also been reported in individuals with various developmental and neurological disorders. These mutations are proposed to involve simultaneous breakage of the genome at many loci and rejoining of these breaks that produce highly rearranged genomes. Since genome sequencing measures only the novel adjacencies present at the time of sequencing, determining whether a collection of novel adjacencies resulted from a complex rearrangement is a complicated and ill-posed problem. Current heuristics for this problem often result in the inference of complex rearrangements that affect many chromosomes. Results. We introduce a model for complex rearrangements that builds upon the methods developed for analyzing simple genome rearrangements such as inversions and translocations. While nearly all of these existing methods use a maximum parsimony assumption of minimizing the number of rearrangements, we propose an alternative maximum parsimony principle based on minimizing the number of chromosomes involved in a rearrangement scenario. We show that our model leads to inference of more plausible sequences of rearrangements that better explain a complex congenital rearrangement in a human genome and chromothripsis events in 22 cancer genomes. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.21/LIPIcs.WABI.2022.21.pdf Genome rearrangements maximum parsimony cancer evolution chromothripsis structural variation affected chromosomes eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 22:1 22:14 10.4230/LIPIcs.WABI.2022.22 article Locality-Sensitive Bucketing Functions for the Edit Distance Chen, Ke 1 https://orcid.org/0000-0001-5470-6621 Shao, Mingfu 1 2 https://orcid.org/0000-0001-6112-5139 Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, United States Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.22/LIPIcs.WABI.2022.22.pdf Locality-sensitive hashing locality-sensitive bucketing long reads embedding eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 23:1 23:19 10.4230/LIPIcs.WABI.2022.23 article phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering Guerrini, Veronica 1 https://orcid.org/0000-0001-8888-9243 Conte, Alessio 1 https://orcid.org/0000-0003-0770-2235 Grossi, Roberto 1 https://orcid.org/0000-0002-7985-4222 Liti, Gianni 2 https://orcid.org/0000-0002-2318-0775 Rosone, Giovanna 1 https://orcid.org/0000-0001-5075-1214 Tattini, Lorenzo 2 https://orcid.org/0000-0002-5477-084X Dipartimento di Informatica, University of Pisa, Italy CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France Molecular phylogenetics is a fundamental branch of biology. It studies the evolutionary relationships among the individuals of a population through their biological sequences, and may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. In this paper we develop a method called phyBWT, describing how to use the extended Burrows-Wheeler Transform (eBWT) for a collection of DNA sequences to directly reconstruct phylogeny, bypassing the alignment against a reference genome or de novo assembly. Our phyBWT hinges on the combinatorial properties of the eBWT positional clustering framework. We employ eBWT to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori), and build a suitable decomposition leading to a phylogenetic tree, step by step. As a result, phyBWT is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. The preliminary experimental results on sequencing data show that our method can handle datasets of different types (short reads, contigs, or entire genomes), producing trees of quality comparable to that found in the benchmark phylogeny. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.23/LIPIcs.WABI.2022.23.pdf Phylogeny partition tree BWT positional cluster alignment-free reference-free assembly-free eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 24:1 24:22 10.4230/LIPIcs.WABI.2022.24 article Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes Rubert, Diego P. 1 2 https://orcid.org/0000-0002-4131-7309 Braga, Marília D. V. 2 https://orcid.org/0000-0002-4092-2646 Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS, Brasil Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. Although the ILP is quite efficient and could conceptually analyze genomes that are not completely assembled but split in several contigs, our tool failed in completing that task. The main reason is that each ILP pairwise comparison includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space. In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into m ≥ 1 subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on real data show that we can now efficiently analyze fruit fly genomes with unfinished assemblies distributed in hundreds or even thousands of contigs, obtaining orthologies that are more similar to FlyBase orthologies when compared to orthologies computed by other inference tools. Moreover, for complete assemblies the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the optimal version of our tool. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.24/LIPIcs.WABI.2022.24.pdf Comparative genomics double-cut-and-join indels gene orthology eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2022-08-26 242 25:1 25:15 10.4230/LIPIcs.WABI.2022.25 article Toward Optimal Fingerprint Indexing for Large Scale Genomics Agret, Clément 1 2 Cazaux, Bastien 1 Limasset, Antoine 1 Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France LIRMM, Univ Montpellier, CNRS, Montpellier, France Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases. https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.25/LIPIcs.WABI.2022.25.pdf Data Structure Indexation Local Sensitive Hashing Genomes Databases

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022</doi>

<documentType>article</documentType>

<title language="eng">LIPIcs, Volume 242, WABI 2022, Complete Volume</title>

<name>Boucher, Christina</name>

<orcid_id>https://orcid.org/0000-0001-9509-9725</orcid_id>

</author>

<name>Rahmann, Sven</name>

<orcid_id>https://orcid.org/0000-0002-8536-6065</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA</affiliationName>

<affiliationName affiliationId="2">Department of Computer Science and Center for Bioinformatics, Saarland University, Saarbrücken, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">LIPIcs, Volume 242, WABI 2022, Complete Volume</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022/LIPIcs.WABI.2022.pdf</fullTextUrl>

<keyword>LIPIcs, Volume 242, WABI 2022, Complete Volume</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.0</doi>

<documentType>article</documentType>

<title language="eng">Front Matter, Table of Contents, Preface, Conference Organization</title>

<name>Boucher, Christina</name>

<orcid_id>https://orcid.org/0000-0001-9509-9725</orcid_id>

</author>

<name>Rahmann, Sven</name>

<orcid_id>https://orcid.org/0000-0002-8536-6065</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA</affiliationName>

<affiliationName affiliationId="2">Department of Computer Science and Center for Bioinformatics, Saarland University, Saarbrücken, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">Front Matter, Table of Contents, Preface, Conference Organization</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.0/LIPIcs.WABI.2022.0.pdf</fullTextUrl>

<keyword>Front Matter</keyword>

<keyword>Table of Contents</keyword>

<keyword>Preface</keyword>

<keyword>Conference Organization</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.1</doi>

<documentType>article</documentType>

<title language="eng">Efficient Solutions to Biological Problems Using de Bruijn Graphs (Invited Talk)</title>

<name>Salmela, Leena</name>

<orcid_id>https://orcid.org/0000-0002-0756-543X</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">University of Helsinki, Finland</affiliationName>

</affiliationsList>

<abstract language="eng">The de Bruijn graph has become a standard method in the analysis of sequencing reads in computational biology due to its ability to represent the information contained in large read sets in small space. A de Bruijn graph represents a set of sequencing reads by its k-mers, i.e. the set of substrings of length k that occur in the reads. In the classical definition, the k-mers are the edges of the graph and the nodes are the k-1 bases long prefixes and suffixes of the k-mers. Usually only k-mers occurring several times in the read set are kept to filter out noise in the data. De Bruijn graphs have been used to solve many problems in computational biology including genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001; Anton Bankevich et al., 2012; Yu Peng et al., 2010], sequencing error correction [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019], reference free variant calling [Raluca Uricaru et al., 2015], indexing read sets [Camille Marchet et al., 2021], and so on. Next I will discuss two of these problems in more depth. The de Bruijn graph first emerged in computation biology in the context of genome assembly [Ramana M. Idury and Michael S. Waterman, 1995; Pavel A. Pevzner et al., 2001] where the task is to reconstruct a genome based on sequencing reads. As the de Bruijn graph can represent large read sets compactly, it became the standard approach to assemble short reads [Anton Bankevich et al., 2012; Yu Peng et al., 2010]. In the theoretical framework of de Bruijn graph based genome assembly, a genome is thought to be the Eulerian path in the de Bruijn graph built on the sequencing reads. In practise, the Eulerian path is not unique and thus not useful in the biological context. Therefore, practical implementations report subpaths that are guaranteed to be part of any Eulerian path and thus part of the actual genome. Such models include unitigs, which are nonbranching paths of the de Bruijn graph, and more involved definitions such as omnitigs [Alexandru I. Tomescu and Paul Medvedev, 2017]. In genome assembly the choice of k is a crucial matter. A small k can result in a tangled graph, whereas a too large k will fragment the graph. Furthermore, a different value of k may be optimal for different parts of the genome. Variable order de Bruijn graphs [Christina Boucher et al., 2015; Djamal Belazzougui et al., 2016], which represent de Bruijn graphs of all orders k in a single data structure, have been proposed as a solution but no rigorous definition corresponding to unitigs has been presented. We give the first definition of assembled sequences, i.e. contigs, on such graphs and an algorithm for enumerating them. Another problem that can be solved with de Bruijn graphs is the correction of sequencing errors [Leena Salmela and Eric Rivals, 2014; Giles Miclotte et al., 2016; Leena Salmela et al., 2017; Limasset et al., 2019]. Because each position of a genome is sequenced several times, it is possible to correct sequencing errors in reads if we can identify data originating from the same genomic region. A de Bruijn graph can be used to represent compactly the reliable information and the individual reads can be corrected by aligning them to the graph.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.1/LIPIcs.WABI.2022.1.pdf</fullTextUrl>

<keyword>de Bruijn graph</keyword>

<keyword>variable order de Bruijn graph</keyword>

<keyword>genome assembly</keyword>

<keyword>sequencing error correction</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.2</doi>

<documentType>article</documentType>

<title language="eng">Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time</title>

<name>Schmidt, Sebastian</name>

<orcid_id>https://orcid.org/0000-0003-4878-2809</orcid_id>

</author>

<name>Alanko, Jarno N.</name>

<orcid_id>https://orcid.org/0000-0002-8003-9225</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">University of Helsinki, Finland</affiliationName>

</affiliationsList>

<abstract language="eng">A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. For that, we present a formalisation of arc-centric bidirected de Bruijn graphs and carefully prove that it accurately models the k-mer spectrum of the input. Our algorithm first constructs the de Bruijn graph in linear time in the length of the input strings (for a fixed-size alphabet). Then it uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.2/LIPIcs.WABI.2022.2.pdf</fullTextUrl>

<keyword>Spectrum preserving string sets</keyword>

<keyword>Eulerian cycle</keyword>

<keyword>Suffix tree</keyword>

<keyword>Bidirected arc-centric de Bruijn graph</keyword>

<keyword>k-mer based methods</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.3</doi>

<documentType>article</documentType>

<title language="eng">Predicting Horizontal Gene Transfers with Perfect Transfer Networks</title>

<name>López Sánchez, Alitzel</name>

</author>

<name>Lafond, Manuel</name>

</author>

</authors>

<affiliationName affiliationId="1">Computer Science Department, Université de Sherbrooke, Canada</affiliationName>

</affiliationsList>

<abstract language="eng">Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers. Their advantage over sequences is that homologous genes can have low DNA similarity, but still have retained enough important common motifs that allow them to have common character traits, for instance the same functional or expression profile. A phylogeny that has two separate clades that acquired the same character independently might indicate the presence of a transfer even in the absence of sequence similarity. We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa. This problem has been studied extensively in the form of ancestral recombination networks, but these only model hybridation events and do not differentiate between direct parents and lateral donors. We focus on tree-based networks, in which edges representing vertical descent are clearly distinguished from those that represent horizontal transmission. Our model is a direct generalization of perfect phylogeny models to such networks. Our goal is to initiate a study on the structural and algorithmic properties of perfect transfer networks. We then show that in polynomial time, one can decide whether a given network is a valid explanation for a set of taxa, and show how, for a given tree, one can add transfer edges to it so that it explains a set of taxa.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.3/LIPIcs.WABI.2022.3.pdf</fullTextUrl>

<keyword>Horizontal gene transfer</keyword>

<keyword>tree-based networks</keyword>

<keyword>perfect phylogenies</keyword>

<keyword>character-based</keyword>

<keyword>gene-expression</keyword>

<keyword>indirect phylogenetic methods</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.4</doi>

<documentType>article</documentType>

<title language="eng">Haplotype Threading Using the Positional Burrows-Wheeler Transform</title>

<name>Sanaullah, Ahsan</name>

</author>

<name>Zhi, Degui</name>

</author>

<name>Zhang, Shaoije</name>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, University of Central Florida, Orlando, FL, USA</affiliationName>

<affiliationName affiliationId="2">School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA</affiliationName>

</affiliationsList>

<abstract language="eng">In the classic model of population genetics, one haplotype (query) is considered as a mosaic copy of segments from a number of haplotypes in a panel, or threading the haplotype through the panel. The Li and Stephens model parameterized this problem using a hidden Markov model (HMM). However, HMM algorithms are linear to the sample size, and can be very expensive for biobank-scale panels. Here, we formulate the haplotype threading problem as the Minimal Positional Substring Cover problem, where a query is represented by a mosaic of a minimal number of substring matches from the panel. We show that this problem can be solved by a sequential set of greedy set maximal matches. Moreover, the solution space can be bounded by the left-most and the right-most solutions by the greedy approach. Based on these results, we formulate and solve several variations of this problem. Although our results are yet to be generalized to the cases with mismatches, they offer a theoretical framework for designing methods for genotype imputation and haplotype phasing.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.4/LIPIcs.WABI.2022.4.pdf</fullTextUrl>

<keyword>Substring Cover</keyword>

<keyword>Haplotype Threading</keyword>

<keyword>Haplotype Matching</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.5</doi>

<documentType>article</documentType>

<title language="eng">Non-Binary Tree Reconciliation with Endosymbiotic Gene Transfer</title>

<name>Gascon, Mathieu</name>

</author>

<name>El-Mabrouk, Nadia</name>

</author>

</authors>

<affiliationName affiliationId="1">Département d'informatique et de recherche opérationnelle (DIRO), Université de Montréal, Canada</affiliationName>

<affiliationName affiliationId="2">DIRO, Université de Montréal, Canada</affiliationName>

</affiliationsList>

<abstract language="eng">Gene transfer between the mitochondrial and nuclear genome of the same species, called endosymbiotic gene transfer (EGT), is a mechanism which has largely shaped gene contents in eukaryotes since a unique ancestral endosymbiotic event know to be at the origin of all mitochondria. The gene tree-species tree reconciliation model has been recently extended to account for EGTs: given a binary gene tree and a binary species tree, the EndoRex software outputs an optimal DLE-Reconciliation, that is an embedding of the gene tree into the species tree inducing a most parsimonious history of Duplications, Losses and EGT events. Here, we provide the first algorithmic study for DLE-Reconciliation in the case of a multifurcated (non-binary) gene tree. We present a general two-steps method: first, ignoring the mitochondrial-nuclear (or 0-1) labeling of leaves, output a binary resolution minimizing the DL-Reconciliation and, for each resolution, assign a known number of 0s and 1s to the leaves in a way minimizing EGT events. While Step 1 corresponds to the well studied non-binary DL-Reconciliation problem, the complexity of the formal label assignment problem related to Step 2 is unknown. Here, we show it is NP-complete even for a single polytomy (non-binary node). We then provide a heuristic which is exact for the unitary cost of operations, and a polynomial-time algorithm for solving a polytomy in the special case where genes are specific to a single genome (mitochondrial or nuclear) in all but one species.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.5/LIPIcs.WABI.2022.5.pdf</fullTextUrl>

<keyword>Reconciliation</keyword>

<keyword>Duplication</keyword>

<keyword>Endosymbiotic gene transfer</keyword>

<keyword>Multifurcated gene tree</keyword>

<keyword>Polytomy</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.6</doi>

<documentType>article</documentType>

<title language="eng">Constructing Founder Sets Under Allelic and Non-Allelic Homologous Recombination</title>

<name>Bonnet, Konstantinn</name>

</author>

<name>Marschall, Tobias</name>

<orcid_id>https://orcid.org/0000-0002-9376-1030</orcid_id>

</author>

<name>Doerr¹, Daniel</name>

<orcid_id>https://orcid.org/0000-0002-3720-6227</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Institute for Medical Biometry and Bioinformatics, Heinrich Heine University, Düsseldorf, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements - including deletion, duplication, and inversion - and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where human haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, and demonstrate scalability to problem instances arising in practice.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.6/LIPIcs.WABI.2022.6.pdf</fullTextUrl>

<keyword>founder set reconstruction</keyword>

<keyword>variation graph</keyword>

<keyword>pangenomics</keyword>

<keyword>homologous recombination</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.7</doi>

<documentType>article</documentType>

<title language="eng">Automated Design of Dynamic Programming Schemes for RNA Folding with Pseudoknots</title>

<name>Marchand, Bertrand</name>

<orcid_id>https://orcid.org/0000-0001-8060-6640</orcid_id>

</author>

<name>Will, Sebastian</name>

<orcid_id>https://orcid.org/0000-0002-2376-9205</orcid_id>

</author>

<name>Berkemer, Sarah J.</name>

<orcid_id>https://orcid.org/0000-0003-2028-7670</orcid_id>

</author>

<name>Bulteau, Laurent</name>

<orcid_id>https://orcid.org/0000-0003-1645-9345</orcid_id>

</author>

<name>Ponty, Yann</name>

<orcid_id>https://orcid.org/0000-0002-7615-3930</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">LIX (UMR 7161), Ecole Polytechnique, Institut Polytechnique de Paris, France</affiliationName>

<affiliationName affiliationId="2">LIGM, CNRS, Univ Gustave Eiffel, F77454 Marne-la-vallée France</affiliationName>

</affiliationsList>

<abstract language="eng">Despite being a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, RNA secondary structure prediction remains challenging whenever pseudoknots come into play. To circumvent the NP-hardness of energy minimization in realistic energy models, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. While these methods rely on hand-crafted DP schemes, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. We formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the tree-width tw of the fatgraph, and its output represents a 𝒪(n^{tw+1}) algorithm for predicting the MFE folding of an RNA of length n. Our general framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.7/LIPIcs.WABI.2022.7.pdf</fullTextUrl>

<keyword>RNA folding</keyword>

<keyword>treewidth</keyword>

<keyword>dynamic programming</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.8</doi>

<documentType>article</documentType>

<title language="eng">Fast and Accurate Species Trees from Weighted Internode Distances</title>

<name>Liu, Baqiao</name>

<orcid_id>https://orcid.org/0000-0002-4210-8269</orcid_id>

</author>

<name>Warnow, Tandy</name>

<orcid_id>https://orcid.org/0000-0001-7717-3514</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, University of Illinois Urbana-Champaign, IL, USA</affiliationName>

</affiliationsList>

<abstract language="eng">Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. Our experimental study evaluating weighted ASTRID shows improvements in accuracy compared to the original (unweighted) ASTRID while remaining fast. Moreover, weighted ASTRID shows competitive accuracy against weighted ASTRAL, the state of the art. Thus, this study provides a new and very fast method for species tree estimation that improves upon ASTRID and has comparable accuracy with the state of the art while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.8/LIPIcs.WABI.2022.8.pdf</fullTextUrl>

<keyword>Species tree estimation</keyword>

<keyword>ASTRID</keyword>

<keyword>ASTRAL</keyword>

<keyword>multi-species coalescent</keyword>

<keyword>incomplete lineage sorting</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.9</doi>

<documentType>article</documentType>

<title language="eng">On Weighted k-mer Dictionaries</title>

<name>Pibiri, Giulio Ermanno</name>

<orcid_id>https://orcid.org/0000-0003-0724-7092</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Ca' Foscari University of Venice, Venice, Italy</affiliationName>

<affiliationName affiliationId="2">ISTI-CNR, Pisa, Italy</affiliationName>

</affiliationsList>

<abstract language="eng">We consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.9/LIPIcs.WABI.2022.9.pdf</fullTextUrl>

<keyword>Weights</keyword>

<keyword>Compression</keyword>

<keyword>Hashing</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.10</doi>

<documentType>article</documentType>

<title language="eng">Accurate k-mer Classification Using Read Profiles</title>

<name>Suzuki, Yoshihiko</name>

<orcid_id>https://orcid.org/0000-0002-8807-2206</orcid_id>

</author>

<name>Myers, Gene</name>

<orcid_id>https://orcid.org/0000-0002-6580-7839</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan</affiliationName>

<affiliationName affiliationId="2">Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany</affiliationName>

<affiliationName affiliationId="3">Center for Systems Biology Dresden, Dresden, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">Contiguous strings of length k, called k-mers, are a fundamental element in many bioinformatics tasks. The number of occurrences of a k-mer in a given set of DNA sequencing reads, its k-mer count, has often been used to roughly estimate the copy number of a k-mer in the genome from which the reads were sampled. The problem of estimating copy numbers, called here the k-mer classification problem, has been based on simply analyzing the histogram of counts of all the k-mers in a data set, thus ignoring the positional context and dependency between multiple k-mers that appear nearby in the underlying genome. Here we present an efficient and significantly more accurate method for classifying k-mers by analyzing the sequence of k-mer counts along each sequencing read, called a read profile. By analyzing read profiles, we explicitly incorporate into the model the dependencies between the positionally adjacent k-mers and the sequence context-dependent error rates estimated from the given dataset. For long sequencing reads produced with the accurate high-fidelity (HiFi) sequencing technology, an implementation of our method, ClassPro, outperforms the conventional, histogram-based method in every simulation dataset of fruit fly and human with various realistic values of sequencing coverage and heterozygosity. Within only a few minutes, ClassPro achieves an average accuracy of > 99.99% across reads without repetitive k-mers and > 99.5% across all reads, in a typical fruit fly simulation data set with a 40× coverage. The resulting, more accurate k-mer classifications by ClassPro are in principle expected to improve any k-mer-based downstream analyses for sequenced reads such as read mapping and overlap, spectral alignment and error correction, haplotype phasing, and trio binning to name but a few. ClassPro is available at https://github.com/yoshihikosuzuki/ClassPro.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.10/LIPIcs.WABI.2022.10.pdf</fullTextUrl>

<keyword>K-mer count</keyword>

<keyword>K-mer classification</keyword>

<keyword>HiFi sequencing</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.11</doi>

<documentType>article</documentType>

<title language="eng">New Algorithms for Structure Informed Genome Rearrangement</title>

<name>Ozery, Eden</name>

</author>

<name>Zehavi, Meirav</name>

</author>

<name>Ziv-Ukelson, Michal</name>

</author>

</authors>

<affiliationName affiliationId="1">Ben Gurion University of the Negev, Israel</affiliationName>

</affiliationsList>

<abstract language="eng">We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, Constrained TreeToString Divergence (CTTSD), we define the basic structure-informed rearrangement based divergence measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to the reference gene order. Then, a structure-informed gene rearrangement measure is computed between the ordered PQ-tree and the target gene order. The second problem, TreeToString Divergence (TTSD), generalizes CTTSD, where the gene order members are not necessarily permutations and the structure-informed rearrangement based divergence measure is extended to also consider up to d_S and d_T gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference order to the target order. The first algorithm solves CTTSD in O(n γ² ⋅ (m_p ⋅ 1.381^γ + m_q)) time and O(n²) space, where γ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and m_p and m_q are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of CTTSD is 0, then the algorithm runs in O(n m γ²) time and O(n²) space. The second algorithm solves TTSD in O(n² γ² {d_T}² {d_S}² m² (m_p ⋅ 5^γ γ + m_q)) time and O(d_T d_S m (m n + 5^γ)) space, where γ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, m_p and m_q are the number of P-nodes and Q-nodes in the tree, respectively, and allowing d_T deletions from the tree and d_S deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of TTSD is 0) in O(n γ² {d_T}² {d_S}² m² (m_p ⋅ 4^γ γ²n(d_T+d_S+m+n) + m_q)) time and O(γ² n m² d_T d_S (d_T+d_S+m+n)) space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1,487 prokaryotic genomes.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.11/LIPIcs.WABI.2022.11.pdf</fullTextUrl>

<keyword>Gene Cluster</keyword>

<keyword>Breakpoint Distance</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.12</doi>

<documentType>article</documentType>

<title language="eng">Fast Gapped k-mer Counting with Subdivided Multi-Way Bucketed Cuckoo Hash Tables</title>

<name>Zentgraf, Jens</name>

<orcid_id>https://orcid.org/0000-0001-9444-2755</orcid_id>

</author>

<name>Rahmann, Sven</name>

<orcid_id>https://orcid.org/0000-0002-8536-6065</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, Saarland University, Saarbrücken, Germany</affiliationName>

<affiliationName affiliationId="2">Center for Bioinformatics, Saarland University, Saarbrücken, Germany</affiliationName>

<affiliationName affiliationId="3">Saarbrücken Graduate School of Computer Science, Saarland University, Saarbrücken, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">Motivation. In biological sequence analysis, alignment-free (also known as k-mer-based) methods are increasingly replacing mapping- and alignment-based methods for various applications. A basic step of such methods consists of building a table of all k-mers of a given set of sequences (a reference genome or a dataset of sequenced reads) and their counts. Over the past years, efficient methods and tools for k-mer counting have been developed. In a different line of work, the use of gapped k-mers has been shown to offer advantages over the use of the standard contiguous k-mers. However, no tool seems to be available that is able to count gapped k-mers with the same efficiency as contiguous k-mers. One reason is that the most efficient k-mer counters use minimizers (of a length m < k) to group k-mers into buckets, such that many consecutive k-mers are classified into the same bucket. This approach leads to cache-friendly (and hence extremely fast) algorithms, but the approach does not transfer easily to gapped k-mers. Consequently, the existing efficient k-mer counters cannot be trivially modified to count gapped k-mers with the same efficiency. Results. We present a different approach that is equally applicable to contiguous k-mers and gapped k-mers. We use multi-way bucketed Cuckoo hash tables to efficiently store (gapped) k-mers and their counts. We also describe a method to parallelize counting over multiple threads without using locks: We subdivide the hash table into independent subtables, and use a producer-consumer model, such that each thread serves one subtable. This requires designing Cuckoo hash functions with the property that all alternative locations for each k-mer are located in the same subtable. Compared to some of the fastest contiguous k-mer counters, our approach is of comparable speed, or even faster, on large datasets, and it is the only one that supports gapped k-mers.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.12/LIPIcs.WABI.2022.12.pdf</fullTextUrl>

<keyword>gapped k-mer</keyword>

<keyword>counting</keyword>

<keyword>Cuckoo hashing</keyword>

<keyword>parallelization</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.13</doi>

<documentType>article</documentType>

<title language="eng">A Linear Time Algorithm for an Extended Version of the Breakpoint Double Distance</title>

<name>Braga, Marília D. V.</name>

<orcid_id>https://orcid.org/0000-0002-4092-2646</orcid_id>

</author>

<name>Brockmann, Leonie R.</name>

</author>

<name>Klerx, Katharina</name>

</author>

<name>Stoye, Jens</name>

<orcid_id>https://orcid.org/0000-0002-4656-7155</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">Two genomes over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. A genome is circular when it contains only circular chromosomes. Different distances of canonical circular genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length. Then, the breakpoint distance is equal to n-c_2, where n is the number of genes and c_2 is the number of cycles of length 2. Similarly, when the considered rearrangements are those modeled by the double-cut-and-join (DCJ) operation, the rearrangement distance is n-c, where c is the total number of cycles. The distance problem is a basic unit for several other combinatorial problems related to genome evolution and ancestral reconstruction, such as median or double distance. Interestingly, both median and double distance problems can be solved in polynomial time for the breakpoint distance, while they are NP-hard for the rearrangement distance. One way of exploring the complexity space between these two extremes is to consider a σ_k distance, defined to be n-(c_2+c_4+…+c_k), and increasingly investigate the complexities of median and double distance for the σ₄ distance, then the σ₆ distance, and so on. While for the median much effort was done in our and in other research groups but no progress was obtained even for the σ₄ distance, for solving the double distance under σ₄ and σ₆ distances we could devise linear time algorithms, which we present here.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.13/LIPIcs.WABI.2022.13.pdf</fullTextUrl>

<keyword>Comparative genomics</keyword>

<keyword>genome rearrangement</keyword>

<keyword>breakpoint distance</keyword>

<keyword>double-cut-and-join (DCJ) distance</keyword>

<keyword>double distance</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.14</doi>

<documentType>article</documentType>

<title language="eng">Efficient Reconciliation of Genomic Datasets of High Similarity</title>

<name>Shibuya, Yoshihiro</name>

<orcid_id>https://orcid.org/0000-0002-3137-1504</orcid_id>

</author>

<name>Belazzougui, Djamal</name>

</author>

<name>Kucherov, Gregory</name>

<orcid_id>https://orcid.org/0000-0001-5899-5424</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">LIGM, Université Gustave Eiffel, Marne-la-Vallée, France</affiliationName>

<affiliationName affiliationId="2">CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria</affiliationName>

<affiliationName affiliationId="3">LIGM, CNRS, Université Gustave Eiffel, Marne-la-Vallée, France</affiliationName>

</affiliationsList>

<abstract language="eng">We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.14/LIPIcs.WABI.2022.14.pdf</fullTextUrl>

<keyword>sketching</keyword>

<keyword>Invertible Bloom Lookup Tables</keyword>

<keyword>MinHash</keyword>

<keyword>syncmers</keyword>

<keyword>minimizers</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.15</doi>

<documentType>article</documentType>

<title language="eng">WGSUniFrac: Applying UniFrac Metric to Whole Genome Shotgun Data</title>

<orcid_id>https://orcid.org/0000-0002-2024-7233</orcid_id>

</author>

<name>Koslicki, David</name>

<orcid_id>https://orcid.org/0000-0002-0640-954X</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">The Pennsylvania State University, University Park, PA, USA</affiliationName>

<affiliationName affiliationId="2">Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA</affiliationName>

<affiliationName affiliationId="3">Department of Biology, The Pennsylvania State University, University Park, PA, USA</affiliationName>

<affiliationName affiliationId="4">Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA</affiliationName>

</affiliationsList>

<abstract language="eng">The UniFrac metric has proven useful in revealing diversity across metagenomic communities. Due to the phylogeny-based nature of this measurement, UniFrac has historically only been applied to 16S rRNA data. Simultaneously, Whole Genome Shotgun (WGS) metagenomics has been increasingly widely employed and proven to provide more information than 16S data, but a UniFrac-like diversity metric suitable for WGS data has not previously been developed. The main obstacle for UniFrac to be applied directly to WGS data is the absence of phylogenetic distances in the taxonomic relationship derived from WGS data. In this study, we demonstrate a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles. We conduct a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.15/LIPIcs.WABI.2022.15.pdf</fullTextUrl>

<keyword>UniFrac</keyword>

<keyword>beta-diversity</keyword>

<keyword>Whole Genome Shotgun</keyword>

<keyword>microbial community similarity</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.16</doi>

<documentType>article</documentType>

<title language="eng">Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning</title>

<name>Bernardini, Giulia</name>

<orcid_id>https://orcid.org/0000-0001-6647-088X</orcid_id>

</author>

<name>van Iersel, Leo</name>

</author>

<name>Julien, Esther</name>

</author>

<name>Stougie, Leen</name>

<orcid_id>https://orcid.org/0000-0001-6938-8902</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">University of Trieste, Italy</affiliationName>

<affiliationName affiliationId="2">CWI, Amsterdam, The Netherlands</affiliationName>

<affiliationName affiliationId="3">Delft Institute of Applied Mathematics, Delft University of Technology, The Netherlands</affiliationName>

<affiliationName affiliationId="4">CWI and Vrije Universiteit, Amsterdam, The Netherlands</affiliationName>

<affiliationName affiliationId="5">Erable, France</affiliationName>

</affiliationsList>

<abstract language="eng">Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.16/LIPIcs.WABI.2022.16.pdf</fullTextUrl>

<keyword>Phylogenetics</keyword>

<keyword>Hybridization</keyword>

<keyword>Cherry Picking</keyword>

<keyword>Machine Learning</keyword>

<keyword>Heuristic</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.17</doi>

<documentType>article</documentType>

<title language="eng">Feasibility of Flow Decomposition with Subpath Constraints in Linear Time</title>

<name>Gibney, Daniel</name>

</author>

<name>Thankachan, Sharma V.</name>

</author>

<name>Aluru, Srinivas</name>

</author>

</authors>

<affiliationName affiliationId="1">Georgia Institute of Technology, Atlanta, GA, USA</affiliationName>

<affiliationName affiliationId="2">North Carolina State University, Raleigh, NC, USA</affiliationName>

</affiliationsList>

<abstract language="eng">The decomposition of flow-networks is an essential part of many transcriptome assembly algorithms used in Computational Biology. The addition of subpath constraints to this decomposition appeared recently as an effective way to incorporate longer, already known, portions of the transcript. The problem is defined as follows: given a weakly connected directed acyclic flow network G = (V, E, f) and a set ℛ of subpaths in G, find a flow decomposition so that every subpath in ℛ is included in some flow in the decomposition [Williams et al., WABI 2021]. The authors of that work presented an exponential time algorithm for determining the feasibility of such a flow decomposition, and more recently presented an O(|E| + L+|ℛ|³) time algorithm, where L is the sum of the path lengths in ℛ [Williams et al., TCBB 2022]. Our work provides an improved, linear O(|E| + L) time algorithm for determining the feasibility of such a flow decomposition. We also introduce two natural optimization variants of the feasibility problem: (i) determining the minimum sized subset of ℛ that must be removed to make a flow decomposition feasible, and (ii) determining the maximum sized subset of ℛ that can be maintained while making a flow decomposition feasible. We show that, under the assumption P ≠ NP, (i) does not admit a polynomial-time o(log |V|)-approximation algorithm and (ii) does not admit a polynomial-time O(|V|^{1/2-ε} + |ℛ|^{1-ε})-approximation algorithm for any constant ε > 0.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.17/LIPIcs.WABI.2022.17.pdf</fullTextUrl>

<keyword>Flow networks</keyword>

<keyword>flow decomposition</keyword>

<keyword>subpath constraints</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.18</doi>

<documentType>article</documentType>

<title language="eng">Prefix-Free Parsing for Building Large Tunnelled Wheeler Graphs</title>

<name>Goga, Adrián</name>

</author>

<name>Baláž, Andrej</name>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia</affiliationName>

<affiliationName affiliationId="2">Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia</affiliationName>

</affiliationsList>

<abstract language="eng">We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting Wheeler graph, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The phrases are then sorted lexicographically. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP representation of the text is generally much shorter than the original since individual phrases are used many times in the parse, thus reducing the size of the dictionary. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the original text, tunnel the Wheeler graph of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact Wheeler graph of the original text. Compared with constructing a Wheeler graph from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of Wheeler graphs as a pangenomic reference for real-world pangenomic datasets.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.18/LIPIcs.WABI.2022.18.pdf</fullTextUrl>

<keyword>Wheeler graphs</keyword>

<keyword>BWT tunnelling</keyword>

<keyword>prefix-free parsing</keyword>

<keyword>pangenomic graphs</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.19</doi>

<documentType>article</documentType>

<title language="eng">Pangenomic Genotyping with the Marker Array</title>

<name>Mun, Taher</name>

<orcid_id>https://orcid.org/0000-0002-3588-0883</orcid_id>

</author>

<name>Vaddadi, Naga Sai Kavya</name>

</author>

<name>Langmead, Ben</name>

<orcid_id>https://orcid.org/0000-0003-2437-1976</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Johns Hopkins University, Baltimore MD, USA</affiliationName>

<affiliationName affiliationId="2">Illumina, San Diego, USA</affiliationName>

<affiliationName affiliationId="3">Johns Hopkins University, USA</affiliationName>

</affiliationsList>

<abstract language="eng">We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.19/LIPIcs.WABI.2022.19.pdf</fullTextUrl>

<keyword>Sequence alignment indexing genotyping</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.20</doi>

<documentType>article</documentType>

<title language="eng">Suffix Sorting via Matching Statistics</title>

<name>Lipták, Zsuzsanna</name>

<orcid_id>https://orcid.org/0000-0002-3233-0691</orcid_id>

</author>

<name>Masillo, Francesco</name>

<orcid_id>https://orcid.org/0000-0002-2078-6835</orcid_id>

</author>

<name>Puglisi, Simon J.</name>

<orcid_id>https://orcid.org/0000-0001-7668-7636</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, University of Verona, Italy</affiliationName>

<affiliationName affiliationId="2">Helsinki Institute for Information Technology (HIIT), Finland</affiliationName>

<affiliationName affiliationId="3">Department of Computer Science, University of Helsinki, Finland</affiliationName>

</affiliationsList>

<abstract language="eng">We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.20/LIPIcs.WABI.2022.20.pdf</fullTextUrl>

<keyword>Generalized suffix array</keyword>

<keyword>matching statistics</keyword>

<keyword>string collections</keyword>

<keyword>compressed representation</keyword>

<keyword>data structures</keyword>

<keyword>efficient algorithms</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.21</doi>

<documentType>article</documentType>

<title language="eng">A Maximum Parsimony Principle for Multichromosomal Complex Genome Rearrangements</title>

<name>Simonaitis, Pijus</name>

<orcid_id>https://orcid.org/0000-0003-3576-8098</orcid_id>

</author>

<name>Raphael, Benjamin J.</name>

<orcid_id>https://orcid.org/0000-0003-1274-048X</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science, Princeton University, Princeton, NJ, USA</affiliationName>

</affiliationsList>

<abstract language="eng">Motivation. Complex genome rearrangements, such as chromothripsis and chromoplexy, are common in cancer and have also been reported in individuals with various developmental and neurological disorders. These mutations are proposed to involve simultaneous breakage of the genome at many loci and rejoining of these breaks that produce highly rearranged genomes. Since genome sequencing measures only the novel adjacencies present at the time of sequencing, determining whether a collection of novel adjacencies resulted from a complex rearrangement is a complicated and ill-posed problem. Current heuristics for this problem often result in the inference of complex rearrangements that affect many chromosomes. Results. We introduce a model for complex rearrangements that builds upon the methods developed for analyzing simple genome rearrangements such as inversions and translocations. While nearly all of these existing methods use a maximum parsimony assumption of minimizing the number of rearrangements, we propose an alternative maximum parsimony principle based on minimizing the number of chromosomes involved in a rearrangement scenario. We show that our model leads to inference of more plausible sequences of rearrangements that better explain a complex congenital rearrangement in a human genome and chromothripsis events in 22 cancer genomes.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.21/LIPIcs.WABI.2022.21.pdf</fullTextUrl>

<keyword>Genome rearrangements</keyword>

<keyword>maximum parsimony</keyword>

<keyword>cancer evolution</keyword>

<keyword>chromothripsis</keyword>

<keyword>structural variation</keyword>

<keyword>affected chromosomes</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.22</doi>

<documentType>article</documentType>

<title language="eng">Locality-Sensitive Bucketing Functions for the Edit Distance</title>

<orcid_id>https://orcid.org/0000-0001-5470-6621</orcid_id>

</author>

<name>Shao, Mingfu</name>

<orcid_id>https://orcid.org/0000-0001-6112-5139</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Department of Computer Science and Engineering, School of Electronic Engineering and Computer Science, The Pennsylvania State University, University Park, PA, United States</affiliationName>

<affiliationName affiliationId="2">Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, United States</affiliationName>

</affiliationsList>

<abstract language="eng">Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.22/LIPIcs.WABI.2022.22.pdf</fullTextUrl>

<keyword>Locality-sensitive hashing</keyword>

<keyword>locality-sensitive bucketing</keyword>

<keyword>long reads</keyword>

<keyword>embedding</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.23</doi>

<documentType>article</documentType>

<title language="eng">phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering</title>

<name>Guerrini, Veronica</name>

<orcid_id>https://orcid.org/0000-0001-8888-9243</orcid_id>

</author>

<name>Conte, Alessio</name>

<orcid_id>https://orcid.org/0000-0003-0770-2235</orcid_id>

</author>

<name>Grossi, Roberto</name>

<orcid_id>https://orcid.org/0000-0002-7985-4222</orcid_id>

</author>

<name>Liti, Gianni</name>

<orcid_id>https://orcid.org/0000-0002-2318-0775</orcid_id>

</author>

<name>Rosone, Giovanna</name>

<orcid_id>https://orcid.org/0000-0001-5075-1214</orcid_id>

</author>

<name>Tattini, Lorenzo</name>

<orcid_id>https://orcid.org/0000-0002-5477-084X</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Dipartimento di Informatica, University of Pisa, Italy</affiliationName>

<affiliationName affiliationId="2">CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France</affiliationName>

</affiliationsList>

<abstract language="eng">Molecular phylogenetics is a fundamental branch of biology. It studies the evolutionary relationships among the individuals of a population through their biological sequences, and may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. In this paper we develop a method called phyBWT, describing how to use the extended Burrows-Wheeler Transform (eBWT) for a collection of DNA sequences to directly reconstruct phylogeny, bypassing the alignment against a reference genome or de novo assembly. Our phyBWT hinges on the combinatorial properties of the eBWT positional clustering framework. We employ eBWT to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori), and build a suitable decomposition leading to a phylogenetic tree, step by step. As a result, phyBWT is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. The preliminary experimental results on sequencing data show that our method can handle datasets of different types (short reads, contigs, or entire genomes), producing trees of quality comparable to that found in the benchmark phylogeny.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.23/LIPIcs.WABI.2022.23.pdf</fullTextUrl>

<keyword>Phylogeny</keyword>

<keyword>partition tree</keyword>

<keyword>positional cluster</keyword>

<keyword>alignment-free</keyword>

<keyword>reference-free</keyword>

<keyword>assembly-free</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.24</doi>

<documentType>article</documentType>

<title language="eng">Gene Orthology Inference via Large-Scale Rearrangements for Partially Assembled Genomes</title>

<name>Rubert, Diego P.</name>

<orcid_id>https://orcid.org/0000-0002-4131-7309</orcid_id>

</author>

<name>Braga, Marília D. V.</name>

<orcid_id>https://orcid.org/0000-0002-4092-2646</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">Faculdade de Computação, Universidade Federal de Mato Grosso do Sul, Campo Grande, MS, Brasil</affiliationName>

<affiliationName affiliationId="2">Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. Although the ILP is quite efficient and could conceptually analyze genomes that are not completely assembled but split in several contigs, our tool failed in completing that task. The main reason is that each ILP pairwise comparison includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space. In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into m ≥ 1 subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on real data show that we can now efficiently analyze fruit fly genomes with unfinished assemblies distributed in hundreds or even thousands of contigs, obtaining orthologies that are more similar to FlyBase orthologies when compared to orthologies computed by other inference tools. Moreover, for complete assemblies the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the optimal version of our tool. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.24/LIPIcs.WABI.2022.24.pdf</fullTextUrl>

<keyword>Comparative genomics</keyword>

<keyword>double-cut-and-join</keyword>

<keyword>indels</keyword>

<keyword>gene orthology</keyword>

</keywords>

</record>

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.WABI.2022.25</doi>

<documentType>article</documentType>

<title language="eng">Toward Optimal Fingerprint Indexing for Large Scale Genomics</title>

<name>Agret, Clément</name>

</author>

<name>Cazaux, Bastien</name>

</author>

<name>Limasset, Antoine</name>

</author>

</authors>

<affiliationName affiliationId="1">Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France</affiliationName>

<affiliationName affiliationId="2">LIRMM, Univ Montpellier, CNRS, Montpellier, France</affiliationName>

</affiliationsList>

<abstract language="eng">Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol242-wabi2022/LIPIcs.WABI.2022.25/LIPIcs.WABI.2022.25.pdf</fullTextUrl>

<keyword>Data Structure</keyword>

<keyword>Indexation</keyword>

<keyword>Local Sensitive Hashing</keyword>

<keyword>Genomes</keyword>

<keyword>Databases</keyword>

</keywords>

</record>

</records>