phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering

Authors Veronica Guerrini , Alessio Conte , Roberto Grossi , Gianni Liti , Giovanna Rosone , Lorenzo Tattini



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.23.pdf
  • Filesize: 1.97 MB
  • 19 pages

Document Identifiers

Author Details

Veronica Guerrini
  • Dipartimento di Informatica, University of Pisa, Italy
Alessio Conte
  • Dipartimento di Informatica, University of Pisa, Italy
Roberto Grossi
  • Dipartimento di Informatica, University of Pisa, Italy
Gianni Liti
  • CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France
Giovanna Rosone
  • Dipartimento di Informatica, University of Pisa, Italy
Lorenzo Tattini
  • CNRS UMR 7284, INSERM U 1081, Université Côte d'Azur, France

Cite AsGet BibTex

Veronica Guerrini, Alessio Conte, Roberto Grossi, Gianni Liti, Giovanna Rosone, and Lorenzo Tattini. phyBWT: Alignment-Free Phylogeny via eBWT Positional Clustering. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 23:1-23:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.23

Abstract

Molecular phylogenetics is a fundamental branch of biology. It studies the evolutionary relationships among the individuals of a population through their biological sequences, and may provide insights about the origin and the evolution of viral diseases, or highlight complex evolutionary trajectories. In this paper we develop a method called phyBWT, describing how to use the extended Burrows-Wheeler Transform (eBWT) for a collection of DNA sequences to directly reconstruct phylogeny, bypassing the alignment against a reference genome or de novo assembly. Our phyBWT hinges on the combinatorial properties of the eBWT positional clustering framework. We employ eBWT to detect relevant blocks of the longest shared substrings of varying length (unlike the k-mer-based approaches that need to fix the length k a priori), and build a suitable decomposition leading to a phylogenetic tree, step by step. As a result, phyBWT is a new alignment-, assembly-, and reference-free method that builds a partition tree without relying on the pairwise comparison of sequences, thus avoiding to use a distance matrix to infer phylogeny. The preliminary experimental results on sequencing data show that our method can handle datasets of different types (short reads, contigs, or entire genomes), producing trees of quality comparable to that found in the benchmark phylogeny.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
  • Mathematics of computing → Combinatorial algorithms
Keywords
  • Phylogeny
  • partition tree
  • BWT
  • positional cluster
  • alignment-free
  • reference-free
  • assembly-free

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004. URL: https://doi.org/10.1016/S1570-8667(03)00065-0.
  2. H.-J. Bandelt and A. W. M. Dress. A canonical decomposition theory for metrics on a finite set. Advances in mathematics, 92(1):47-105, 1992. Google Scholar
  3. H.-J. Bandelt and A. W. M. Dress. Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Molecular Phylogenetics and Evolution, 1(3):242-252, 1992. URL: https://doi.org/10.1016/1055-7903(92)90021-8.
  4. H.-J. Bandelt, K. T. Huber, J. H. Koolen, V. Moulton, and A. Spillner. Basic Phylogenetic Combinatorics. Cambridge University Press, 2012. URL: http://www.cambridge.org/de/knowledge/isbn/item6439332/.
  5. M.J. Bauer, A.J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483(0):134-148, 2013. URL: https://doi.org/10.1016/j.tcs.2012.02.002.
  6. A. M Bolger, M. Lohse, and B. Usadel. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, 30(15):2114-20, 2014. URL: https://doi.org/10.1093/bioinformatics/btu170.
  7. P. Bonizzoni, G. Della Vedova, Y. Pirola, M. Previtali, and R. Rizzi. Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array. Journal of computational biology, 26(9):948-961, 2019. URL: https://doi.org/10.1089/cmb.2018.0230.
  8. C. Boucher, D. Cenzato, Z. Lipták, M. Rossi, and M. Sciortino. Computing the original ebwt faster, simpler, and with less memory. In SPIRE, pages 129-142. Springer International Publishing, 2021. Google Scholar
  9. M. Burrows and D.J. Wheeler. A Block Sorting data Compression Algorithm. Technical report, DIGITAL System Research Center, 1994. Google Scholar
  10. A.J. Cox, F. Garofalo, G. Rosone, and M. Sciortino. Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37:17-33, 2016. URL: https://doi.org/10.1016/j.jda.2016.03.003.
  11. M. A. Crosby, J. L. Goodman, V. B. Strelets, P. Zhang, W. M. Gelbart, and The FlyBase Consortium. FlyBase: genomes by the dozen. Nucleic Acids Research, 35(suppl.1):D486-D491, November 2006. Google Scholar
  12. M. D'Angiolo, M. De Chiara, J.-X. Yue, A. Irizar, S. Stenberg, K. Persson, A. Llored, B. Barré, J. Schacherer, R. Marangoni, E. Gilson, J. Warringer, and G. Liti. A yeast living ancestor reveals the origin of genomic introgressions. Nature, 587(7834):420-425, November 2020. Google Scholar
  13. L. Egidi, F. A. Louza, G. Manzini, and G. P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14(1):6:1-6:15, 2019. URL: https://doi.org/10.1186/s13015-019-0140-0.
  14. H. Fan, A. R Ives, Y. Surget-Groba, and C. H Cannon. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC genomics, 16(1):1-18, 2015. Google Scholar
  15. J. F. Finke, D. M. Winget, A. M. Chan, and C. Suttle. Variation in the genetic repertoire of viruses infecting micromonas pusilla reflects horizontal gene transfer and links to their environmental distribution. Viruses, 9, 2017. Google Scholar
  16. Travis Gagie, Juha Kärkkäinen, Gonzalo Navarro, and Simon J. Puglisi. Colored range queries and document retrieval. Theoretical Computer Science, 483:36-50, 2013. URL: https://doi.org/10.1016/j.tcs.2012.08.004.
  17. B. Gallone, J. Steensels, S. Mertens, M. C. Dzialo, J. L. Gordon, R. Wauters, F. A. Theßeling, F. Bellinazzo, V. Saels, B. Herrera-Malaver, T. Prahl, C. White, M. Hutzler, F. Meußdoerffer, P. Malcorps, B. Souffriau, L. Daenen, G. Baele, S. Maere, and K. J. Verstrepen. Interspecific hybridization facilitates niche adaptation in beer yeast. Nat Ecol Evol, 3(11):1562-1575, November 2019. Google Scholar
  18. V. Guerrini., F. Louza., and G. Rosone. Lossy Compressor Preserving Variant Calling through Extended BWT. In BIOSTEC/BIOINFORMATICS, pages 38-48. INSTICC, SciTePress, 2022. URL: https://doi.org/10.5220/0010834100003123.
  19. V. Guerrini, F.A. Louza, and G. Rosone. Metagenomic analysis through the extended Burrows-Wheeler transform. BMC Bioinformatics, 21, 2020. URL: https://doi.org/10.1186/s12859-020-03628-w.
  20. D. H. Huson and D. Bryant. Application of Phylogenetic Networks in Evolutionary Studies. Molecular Biology and Evolution, 23(2):254-267, October 2005. Google Scholar
  21. J. Jansson and W.-K. Sung. Algorithms for Combining Rooted Triplets into a Galled Phylogenetic Network, pages 48-52. Springer New York, New York, NY, 2016. URL: https://doi.org/10.1007/978-1-4939-2864-4_92.
  22. J. Jansson and W.-K. Sung. Maximum Agreement Supertree, pages 1224-1227. Springer New York, New York, NY, 2016. URL: https://doi.org/10.1007/978-1-4939-2864-4_222.
  23. F. A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and G. Rosone. gsufsort: constructing suffix arrays, lcp arrays and bwts for string collections. Algorithms for Molecular Biology, 15, 2020. Google Scholar
  24. Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In ACM-SIAM SODA, pages 319-327, 1990. Google Scholar
  25. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theoret. Comput. Sci., 387(3):298-312, 2007. Google Scholar
  26. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. A new combinatorial approach to sequence comparison. Theory Comput. Syst., 42(3):411-429, 2008. URL: https://doi.org/10.1007/s00224-007-9078-6.
  27. J. Peter, M. De Chiara, A. Friedrich, J.-X. Yue, D. Pflieger, A. Bergström, A. Sigwalt, B. Barre, K. Freel, A. Llored, C. Cruaud, K. Labadie, J.-M. Aury, B. Istace, K. Lebrigand, P. Barbry, S. Engelen, A. Lemainque, P. Wincker, and J. Schacherer. Genome evolution across 1,011 saccharomyces cerevisiae isolates. Nature, 556, April 2018. URL: https://doi.org/10.1038/s41586-018-0030-5.
  28. N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. SNPs detection by eBWT positional clustering. Algorithms for Molecular Biology, 14(1):3, 2019. URL: https://doi.org/10.1186/s13015-019-0137-8.
  29. N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. Variable-order reference-free variant discovery with the Burrows-Wheeler transform. BMC Bioinformatics, 21, 2020. URL: https://doi.org/10.1186/s12859-020-03586-3.
  30. N. Prezza and G. Rosone. Space-efficient construction of compressed suffix trees. Theoretical Computer Science, 852:138-156, 2021. URL: https://doi.org/10.1016/j.tcs.2020.11.024.
  31. A. Rempel and R. Wittler. SANS serif: alignment-free, whole-genome-based phylogenetic reconstruction. Bioinformatics, 37(24):4868-4870, 2021. URL: https://doi.org/10.1093/bioinformatics/btab444.
  32. N. Saitou and M. Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406-425, 1987. Google Scholar
  33. S. M Soucy, J. Huang, and J. P. Gogarten. Horizontal gene transfer: building the web of life. Nat Rev Genet, 16(8):472-82, August 2015. Google Scholar
  34. L. Tattini, N. Tellini, S. Mozzachiodi, M. D'Angiolo, S. Loeillet, A. Nicolas, and G. Liti. Accurate tracking of the mutational landscape of diploid hybrid genomes. Mol Biol Evol, August 2019. Google Scholar
  35. S. Vinga. Alignment-free methods in computational biology, 2014. Google Scholar
  36. S. Vinga and J. Almeida. Alignment-free sequence comparison—a review. Bioinformatics, 19(4):513-523, 2003. URL: https://doi.org/10.1093/bioinformatics/btg005.
  37. T. Warnow. Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, 2017. Google Scholar
  38. R. Wittler. Alignment- and Reference-Free Phylogenomics with Colored de Bruijn Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), volume 143, pages 2:1-2:14, Dagstuhl, Germany, 2019. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.WABI.2019.2.
  39. L. Yang, X. Zhang, and T. Wang. The Burrows-Wheeler similarity distribution between biological sequences based on Burrows-Wheeler transform. Journal of Theoretical Biology, 262(4):742-749, 2010. URL: https://doi.org/10.1016/j.jtbi.2009.10.033.
  40. Z. Yang and B. Rannala. Molecular phylogenetics: Principles and practice. Nature reviews. Genetics, 13:303-14, March 2012. URL: https://doi.org/10.1038/nrg3186.
  41. J.-X. Yue, J. Li, L. Aigrain, J. Hallin, K. Persson, K. Oliver, A. Bergström, P. Coupland, J. Warringer, M. C. Lagomarsino, G. Fischer, R. Durbin, and G. Liti. Contrasting evolutionary genome dynamics between domesticated and wild yeasts. Nature Genetics, 49(6):913-924, 2017. URL: https://doi.org/10.1038/ng.3847.
  42. A. Zielezinski, S. Vinga, J. Almeida, and W. Karlowski. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biology, 18:186, October 2017. URL: https://doi.org/10.1186/s13059-017-1319-7.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail