phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering

Authors Veronica Guerrini , Alessio Conte , Roberto Grossi , Gianni Liti , Giovanna Rosone , Lorenzo Tattini

Author Details

Veronica Guerrini
  • Dipartimento di Informatica, University of Pisa, Italy
Alessio Conte
  • Dipartimento di Informatica, University of Pisa, Italy
Roberto Grossi
  • Dipartimento di Informatica, University of Pisa, Italy
Gianni Liti
  • CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France
Giovanna Rosone
  • Dipartimento di Informatica, University of Pisa, Italy
Lorenzo Tattini
  • CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France

Veronica Guerrini, Alessio Conte, Roberto Grossi, Gianni Liti, Giovanna Rosone, and Lorenzo Tattini. phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 23:1-23:19, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)


Molecular phylogenetics is a fundamental branch of biology. It studies the evolutionary relationships among the individuals of a population through their biological sequences, and may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. In this paper we develop a method called phyBWT, describing how to use the extended Burrows-Wheeler Transform (eBWT) for a collection of DNA sequences to directly reconstruct phylogeny, bypassing the alignment against a reference genome or de novo assembly. Our phyBWT hinges on the combinatorial properties of the eBWT positional clustering framework. We employ eBWT to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori), and build a suitable decomposition leading to a phylogenetic tree, step by step. As a result, phyBWT is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. The preliminary experimental results on sequencing data show that our method can handle datasets of different types (short reads, contigs, or entire genomes), producing trees of quality comparable to that found in the benchmark phylogeny.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Mathematics of computing → Combinatorial algorithms
  • Phylogeny
  • partition tree
  • BWT
  • positional cluster
  • alignment-free
  • reference-free
  • assembly-free


