WGSUniFrac: Applying UniFrac Metric to Whole Genome Shotgun Data

Authors Wei Wei , David Koslicki

Thumbnail PDF


  • Filesize: 1.61 MB
  • 22 pages

Document Identifiers

Author Details

Wei Wei
  • The Pennsylvania State University, University Park, PA, USA
David Koslicki
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA
  • Department of Biology, The Pennsylvania State University, University Park, PA, USA
  • Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA

Cite AsGet BibTex

Wei Wei and David Koslicki. WGSUniFrac: Applying UniFrac Metric to Whole Genome Shotgun Data. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 15:1-15:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)


The UniFrac metric has proven useful in revealing diversity across metagenomic communities. Due to the phylogeny-based nature of this measurement, UniFrac has historically only been applied to 16S rRNA data. Simultaneously, Whole Genome Shotgun (WGS) metagenomics has been increasingly widely employed and proven to provide more information than 16S data, but a UniFrac-like diversity metric suitable for WGS data has not previously been developed. The main obstacle for UniFrac to be applied directly to WGS data is the absence of phylogenetic distances in the taxonomic relationship derived from WGS data. In this study, we demonstrate a method to overcome this intrinsic difference and compute the UniFrac metric on WGS data by assigning branch lengths to the taxonomic tree obtained from input taxonomic profiles. We conduct a series of experiments to demonstrate that this WGSUniFrac method is comparably robust to traditional 16S UniFrac and is not highly sensitive to branch lengths assignments, be they data-derived or model-prescribed.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
  • Applied computing → Bioinformatics
  • Applied computing → Computational genomics
  • UniFrac
  • beta-diversity
  • Whole Genome Shotgun
  • microbial community similarity


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Cami-challenge. https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd, 2015.
  2. woltka. https://github.com/qiyunzhu/woltka/blob/master/doc/ogu.md, commit = 7ef8318, 2022.
  3. Johanne Ahrenfeldt, Carina Skaarup, Henrik Hasman, Anders Gorm Pedersen, Frank Møller Aarestrup, and Ole Lund. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics, 18(1):19, 2017. URL: https://doi.org/10.1186/s12864-016-3407-6.
  4. Florent E. Angly, Dana Willner, Forest Rohwer, Philip Hugenholtz, and Gene W. Tyson. Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Research, 40(12):e94-e94, 2012. URL: https://doi.org/10.1093/nar/gks251.
  5. Francois Balloux, Ola Brønstad Brynildsrud, Lucy van Dorp, Liam P. Shaw, Hongbin Chen, Kathryn A. Harris, Hui Wang, and Vegard Eldholm. From theory to practice: Translating whole-genome sequencing (wgs) into the clinic. Trends in Microbiology, 26(12):1035-1048, 2018. URL: https://doi.org/10.1016/j.tim.2018.08.004.
  6. Sébastien Boutin, Simon Y. Graeber, Michael Weitnauer, Jessica Panitz, Mirjam Stahl, Diana Clausznitzer, Lars Kaderali, Gisli Einarsson, Michael M. Tunney, J. Stuart Elborn, Marcus A. Mall, and Alexander H. Dalpke. Comparison of microbiomes from different niches of upper and lower airways in children and adolescents with cystic fibrosis. PLoS ONE, 10(1):e0116029, 2015. URL: https://doi.org/10.1371/journal.pone.0116029.
  7. Benjamin J Callahan, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. Dada2: High-resolution sample inference from illumina amplicon data. Nature Methods, 13(7):581-583, 2016. URL: https://doi.org/10.1038/nmeth.3869.
  8. J Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D Bushman, Elizabeth K Costello, Noah Fierer, Antonio Gonzalez Peña, Julia K Goodrich, Jeffrey I Gordon, and et al. Qiime allows analysis of high-throughput community sequencing data. Nature Methods, 7(5):335-336, 2010. qiime citation. URL: https://doi.org/10.1038/nmeth.f.303.
  9. Alexander L. Carlson, Kai Xia, M. Andrea Azcarate-Peril, Samuel P. Rosin, Jason P. Fine, Wancen Mu, Jared B. Zopp, Mary C. Kimmel, Martin A. Styner, Amanda L. Thompson, Cathi B. Propper, and Rebecca C. Knickmeyer. Infant gut microbiome composition is associated with non-social fear behavior in a pilot study. Nature Communications, 12(1):3294, 2021. URL: https://doi.org/10.1038/s41467-021-23281-y.
  10. Shifu Chen, Yanqing Zhou, Yaru Chen, and Jia Gu. fastp: an ultra-fast all-in-one fastq preprocessor. Bioinformatics, 34(17):i884-i890, 2018. URL: https://doi.org/10.1093/bioinformatics/bty560.
  11. Elizabeth K. Costello, Erica M. Carlisle, Elisabeth M. Bik, Michael J. Morowitz, and David A. Relman. Microbiome assembly across multiple body sites in low-birthweight infants. mBio, 4(6):e00782-13, 2013. URL: https://doi.org/10.1128/mbio.00782-13.
  12. T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen. Greengenes, a chimera-checked 16s rrna gene database and workbench compatible with arb. Applied and Environmental Microbiology, 72(7):5069-5072, 2006. URL: https://doi.org/10.1128/aem.03006-05.
  13. Young-Gyu Eun, Jung-Woo Lee, Seung Woo Kim, Dong-Wook Hyun, Jin-Woo Bae, and Young Chan Lee. Oral microbiome associated with lymph node metastasis in oral squamous cell carcinoma. Scientific Reports, 11(1):23176, 2021. URL: https://doi.org/10.1038/s41598-021-02638-9.
  14. Steven N. Evans and Frederick A. Matsen. The phylogenetic kantorovich–rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):569-592, 2012. URL: https://doi.org/10.1111/j.1467-9868.2011.01018.x.
  15. Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of phyml 3.0. Systematic Biology, 59(3):307-321, 2010. URL: https://doi.org/10.1093/sysbio/syq010.
  16. Micah Hamady, Catherine Lozupone, and Rob Knight. Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. The ISME journal, 4(1):17-27, 2010. Google Scholar
  17. Micah Hamady, Catherine Lozupone, and Rob Knight. Fast unifrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and phylochip data. The ISME Journal, 4(1):17-27, 2010. URL: https://doi.org/10.1038/ismej.2009.97.
  18. Jaime Huerta-Cepas, François Serra, and Peer Bork. Ete 3: Reconstruction, analysis, and visualization of phylogenomic data. Molecular Biology and Evolution, 33(6):1635-1638, 2016. ete3 package. URL: https://doi.org/10.1093/molbev/msw046.
  19. Luisa W. Hugerth and Anders F. Andersson. Analysing microbial community composition through amplicon sequencing: From sampling to hypothesis testing. Frontiers in Microbiology, 8:1561, 2017. URL: https://doi.org/10.3389/fmicb.2017.01561.
  20. Curtis Huttenhower, Dirk Gevers, Rob Knight, Sahar Abubucker, Jonathan H. Badger, Asif T. Chinwalla, Heather H. Creasy, Ashlee M. Earl, Michael G. FitzGerald, Robert S. Fulton, and et al. Structure, function and diversity of the healthy human microbiome. Nature, 486(7402):207-214, 2012. URL: https://doi.org/10.1038/nature11234.
  21. Stefan Janssen, Daniel McDonald, Antonio Gonzalez, Jose A. Navas-Molina, Lingjing Jiang, Zhenjiang Zech Xu, Kevin Winker, Deborah M. Kado, Eric Orwoll, Mark Manary, and et al. Phylogenetic placement of exact amplicon sequences improves associations with clinical information. mSystems, 3(3):e00021-18, 2018. URL: https://doi.org/10.1128/msystems.00021-18.
  22. Jonathan Kans. Entrez direct: E-utilities on the unix command line - entrez programming utilities help - ncbi bookshelf, April 2013. URL: https://www.ncbi.nlm.nih.gov/books/NBK179288/.
  23. Jonas Coelho Kasmanas, Alexander Bartholomäus, Felipe Borim Corrêa, Tamara Tal, Nico Jehmlich, Gunda Herberth, Martin von Bergen, Peter F Stadler, André Carlos Ponce de Leon Ferreira de Carvalho, and Ulisses Nunes da Rocha. Humanmetagenomedb: a public repository of curated and standardized metadata for human metagenomes. Nucleic Acids Research, 49(D1):gkaa1031-, 2020. URL: https://doi.org/10.1093/nar/gkaa1031.
  24. C. J. Keylock. Simpson diversity and the shannon–wiener index as special cases of a generalized entropy. Oikos, 109(1):203-207, 2005. URL: https://doi.org/10.1111/j.0030-1299.2005.13735.x.
  25. Lusine Khachatryan, Rick H. de Leeuw, Margriet E.M. Kraakman, Nikos Pappas, Marije te Raa, Hailiang Mei, Peter de Knijff, and Jeroen F.J. Laros. Taxonomic classification and abundance estimation using 16s and wgs - A comparison using controlled reference samples. Forensic Science International: Genetics, 46:102257, 2020. URL: https://doi.org/10.1016/j.fsigen.2020.102257.
  26. Omry Koren, Aymé Spor, Jenny Felin, Frida Fåk, Jesse Stombaugh, Valentina Tremaroli, Carl Johan Behre, Rob Knight, Björn Fagerberg, Ruth E. Ley, and et al. Human oral, gut, and plaque microbiota in patients with atherosclerosis. Proceedings of the National Academy of Sciences, 108:4592-4598, 2011. URL: https://doi.org/10.1073/pnas.1011383107.
  27. David Koslicki and Daniel Falush. Metapalette: A k-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation. bioRxiv, page 039909, 2016. URL: https://doi.org/10.1101/039909.
  28. Chao Liang, Han-Chi Tseng, Hui-Mei Chen, Wei-Chi Wang, Chih-Min Chiu, Jen-Yun Chang, Kuan-Yi Lu, Shun-Long Weng, Tzu-Hao Chang, Chao-Hsiang Chang, Chen-Tsung Weng, Hwei-Ming Wang, and Hsien-Da Huang. Diversity and enterotype in gut bacterial community of adults in taiwan. BMC Genomics, 18(Suppl 1):932, 2017. URL: https://doi.org/10.1186/s12864-016-3261-6.
  29. Kevin Liu, Sindhu Raghavan, Serita Nelesen, C. Randal Linder, and Tandy Warnow. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934):1561-1564, 2009. URL: https://doi.org/10.1126/science.1171243.
  30. Catherine Lozupone, Micah Hamady, and Rob Knight. Unifrac-an online tool for comparing microbial community diversity in a phylogenetic context. BMC bioinformatics, 7(1):1-14, 2006. Google Scholar
  31. Catherine Lozupone and Rob Knight. Unifrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12):8228-8235, 2005. URL: https://doi.org/10.1128/aem.71.12.8228-8235.2005.
  32. Catherine Lozupone, Manuel E Lladser, Dan Knights, Jesse Stombaugh, and Rob Knight. Unifrac: an effective distance metric for microbial community comparison. The ISME Journal, 5(2):169-172, 2011. URL: https://doi.org/10.1038/ismej.2010.133.
  33. Catherine A. Lozupone, Micah Hamady, Scott T. Kelley, and Rob Knight. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5):1576-1585, 2007. URL: https://doi.org/10.1128/aem.01996-06.
  34. Jason McClelland. Wasserstein β-diversity metrics over graphs: Derivation, efficient computation and application, 2018. Google Scholar
  35. Jason McClelland and David Koslicki. Emdunifrac: exact linear time computation of the unifrac metric and identification of differentially abundant organisms. Journal of Mathematical Biology, 77(4):935-949, 2018. URL: https://doi.org/10.1007/s00285-018-1235-9.
  36. Daniel McDonald, Morgan N Price, Julia Goodrich, Eric P Nawrocki, Todd Z DeSantis, Alexander Probst, Gary L Andersen, Rob Knight, and Philip Hugenholtz. An improved greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME Journal, 6(3):610-618, 2012. URL: https://doi.org/10.1038/ismej.2011.139.
  37. Daniel McDonald, Yoshiki Vázquez-Baeza, David Koslicki, Jason McClelland, Nicolai Reeve, Zhenjiang Xu, Antonio Gonzalez, and Rob Knight. Striped unifrac: enabling microbiome analysis at unprecedented scale. Nature methods, 15(11):847-848, 2018. Google Scholar
  38. Daniel McDonald, Yoshiki Vázquez-Baeza, David Koslicki, Jason McClelland, Nicolai Reeve, Zhenjiang Xu, Antonio Gonzalez, and Rob Knight. Striped unifrac: enabling microbiome analysis at unprecedented scale. Nature Methods, 15(11):847-848, 2018. URL: https://doi.org/10.1038/s41592-018-0187-8.
  39. F. Meyer, A. Fritz, Z.-L. Deng, D. Koslicki, A. Gurevich, G. Robertson, M. Alser, D. Antipov, F. Beghini, D. Bertrand, and et al. Critical assessment of metagenome interpretation - the second round of challenges. bioRxiv, page 2021.07.12.451567, 2021. URL: https://doi.org/10.1101/2021.07.12.451567.
  40. Alessio Milanese, Daniel R Mende, Lucas Paoli, Guillem Salazar, Hans-Joachim Ruscheweyh, Miguelangel Cuenca, Pascal Hingamp, Renato Alves, Paul I Costea, Luis Pedro Coelho, and et al. Microbial abundance, activity and population genomic profiling with motus2. Nature Communications, 10(1):1014, 2019. URL: https://doi.org/10.1038/s41467-019-08844-4.
  41. Vanessa Moura, Iris Ribeiro, Priscilla Moriggi, Artur Capão, Carolina Salles, Suleima Bitati, and Luciano Procópio. The influence of surface microbial diversity and succession on microbiologically influenced corrosion of steel in a simulated marine environment. Archives of Microbiology, 200(10):1447-1456, 2018. URL: https://doi.org/10.1007/s00203-018-1559-2.
  42. Nam-phuong Nguyen, Siavash Mirarab, Bo Liu, Mihai Pop, and Tandy Warnow. Tipp: taxonomic identification and phylogenetic profiling. Bioinformatics, 30(24):3548-3555, 2014. URL: https://doi.org/10.1093/bioinformatics/btu721.
  43. Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1):132, 2016. URL: https://doi.org/10.1186/s13059-016-0997-x.
  44. Donovan H. Parks, Maria Chuvochina, Pierre-Alain Chaumeil, Christian Rinke, Aaron J. Mussig, and Philip Hugenholtz. A complete domain-to-species taxonomy for bacteria and archaea. Nature Biotechnology, 38(9):1079-1086, 2020. URL: https://doi.org/10.1038/s41587-020-0501-8.
  45. Donovan H Parks, Maria Chuvochina, David W Waite, Christian Rinke, Adam Skarshewski, Pierre-Alain Chaumeil, and Philip Hugenholtz. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, 36(10):996-1004, 2018. URL: https://doi.org/10.1038/nbt.4229.
  46. N. Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, and C. Titus Brown. Large-scale sequence comparisons with sourmash. F1000Research, 8:1006, 2019. URL: https://doi.org/10.12688/f1000research.19675.1.
  47. Rachel Poretsky, Luis M. Rodriguez-R, Chengwei Luo, Despina Tsementzi, and Konstantinos T. Konstantinidis. Strengths and limitations of 16s rrna gene amplicon sequencing in revealing temporal microbial community dynamics. PLoS ONE, 9(4):e93827, 2014. URL: https://doi.org/10.1371/journal.pone.0093827.
  48. Ravi Ranjan, Asha Rani, Ahmed Metwally, Halvor S. McGee, and David L. Perkins. Analysis of the microbiome: Advantages of whole genome shotgun versus 16s amplicon sequencing. Biochemical and Biophysical Research Communications, 469(4):967-977, 2016. URL: https://doi.org/10.1016/j.bbrc.2015.12.083.
  49. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53-65, 1987. URL: https://doi.org/10.1016/0377-0427(87)90125-7.
  50. Alexander Sczyrba, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen, Johannes Dröge, Ivan Gregor, Stephan Majda, Jessika Fiedler, Eik Dahms, and et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature Methods, 14(11):1063-1071, 2017. URL: https://doi.org/10.1038/nmeth.4458.
  51. Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, and Curtis Huttenhower. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods, 9(8):811-814, 2012. URL: https://doi.org/10.1038/nmeth.2066.
  52. Wei Shen and Hong Ren. Taxonkit: A practical and efficient ncbi taxonomy toolkit. Journal of Genetics and Genomics, 48(9):844-850, 2021. URL: https://doi.org/10.1016/j.jgg.2021.03.006.
  53. Nathan G. Swenson. Phylogenetic beta diversity metrics, trait evolution and inferring the functional beta diversity of communities. PLoS ONE, 6(6):e21264, 2011. URL: https://doi.org/10.1371/journal.pone.0021264.
  54. Marie Touchon, Claire Hoede, Olivier Tenaillon, Valérie Barbe, Simon Baeriswyl, Philippe Bidet, Edouard Bingen, Stéphane Bonacorsi, Christiane Bouchier, Odile Bouvet, and et al. Organised genome dynamics in the escherichia coli species results in highly diverse adaptive paths. PLoS Genetics, 5(1):e1000344, 2009. URL: https://doi.org/10.1371/journal.pgen.1000344.
  55. Gary D. Wu, Jun Chen, Christian Hoffmann, Kyle Bittinger, Ying-Yu Chen, Sue A. Keilbaugh, Meenakshi Bewtra, Dan Knights, William A. Walters, Rob Knight, and et al. Linking long-term dietary patterns with gut microbial enterotypes. Science, 334(6052):105-108, 2011. URL: https://doi.org/10.1126/science.1208344.
  56. Alexandra Zhernakova, Alexander Kurilshikov, Marc Jan Bonder, Ettje F. Tigchelaar, Melanie Schirmer, Tommi Vatanen, Zlatan Mujagic, Arnau Vich Vila, Gwen Falony, Sara Vieira-Silva, and et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science, 352(6285):565-569, 2016. URL: https://doi.org/10.1126/science.aad3369.
  57. Qiyun Zhu, Shi Huang, Antonio Gonzalez, Imran McGrath, Daniel McDonald, Niina Haiminen, George Armstrong, Yoshiki Vázquez-Baeza, Julian Yu, Justin Kuczynski, Gregory D. Sepich-Poore, Austin D. Swafford, Promi Das, Justin P. Shaffer, Franck Lejzerowicz, Pedro Belda-Ferre, Aki S. Havulinna, Guillaume Méric, Teemu Niiranen, Leo Lahti, Veikko Salomaa, Ho-Cheol Kim, Mohit Jain, Michael Inouye, Jack A. Gilbert, and Rob Knight. Phylogeny-aware analysis of metagenome community ecology based on matched reference genomes while bypassing taxonomy. mSystems, pages e00167-22, 2022. URL: https://doi.org/10.1128/msystems.00167-22.
  58. Qiyun Zhu, Uyen Mai, Wayne Pfeiffer, Stefan Janssen, Francesco Asnicar, Jon G. Sanders, Pedro Belda-Ferre, Gabriel A. Al-Ghalith, Evguenia Kopylova, Daniel McDonald, Tomasz Kosciolek, John B. Yin, Shi Huang, Nimaichand Salam, Jian-Yu Jiao, Zijun Wu, Zhenjiang Z. Xu, Kalen Cantrell, Yimeng Yang, Erfan Sayyari, Maryam Rabiee, James T. Morton, Sheila Podell, Dan Knights, Wen-Jun Li, Curtis Huttenhower, Nicola Segata, Larry Smarr, Siavash Mirarab, and Rob Knight. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains bacteria and archaea. Nature Communications, 10(1):5477, 2019. URL: https://doi.org/10.1038/s41467-019-13443-4.
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail