Reconstructing the Tree of Life (Fitting Distances by Tree Metrics) (Invited Paper)

Thorup, Mikkel

doi:10.4230/LIPIcs.SWAT.2022.3

File

LIPIcs.SWAT.2022.3.pdf

Filesize: 412 kB
2 pages

Document Identifiers

DOI: 10.4230/LIPIcs.SWAT.2022.3
URN: urn:nbn:de:0030-drops-161631

Author Details

Mikkel Thorup

BARC, Department of Computer Science, University of Copenhagen, Denmark

Cite AsGet BibTex

Mikkel Thorup. Reconstructing the Tree of Life (Fitting Distances by Tree Metrics) (Invited Paper). In 18th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 227, pp. 3:1-3:2, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.SWAT.2022.3

Abstract

We consider the numerical taxonomy problem of fitting an S× S distance matrix D with a tree metric T. Here T is a weighted tree spanning S where the path lengths in T induce a metric on S. If there is a tree metric matching D exactly, then it is easily found. If there is no exact match, then for some k, we want to minimize the L_k norm of the errors, that is, pick T so as to minimize ‖D-T‖_k = (∑_{i,j ∈ S} |D(i,j)-T(i,j)|^k) ^{1/k}. This problem was raised in biology in the 1960s for k = 1,2. The biological interpretation is that T represents a possible evolution behind the species in S matching some measured distances in D. Sometimes, it is required that T is an ultrametric, meaning that all species are at the same distance from the root. An evolutionary tree induces a hierarchical classification of species and this is not just tied to biology. Medicine, ecology and linguistics are just some of the fields where this concept appears, and it is even an integral part of machine learning and data science. Fundamentally, if we can approximate distances with a tree, then they are much easier to reason about: many questions that are NP-hard for general metrics can be answered in linear time on tree metrics. In fact, humans have appreciated hierarchical classifications at least since Plato and Aristotle (350 BC). The numerical taxonomy problem is important in practice and many heuristics have been proposed. In this talk we will review the basic algorithmic theory, results and techniques, for the problem, including the most recent result from FOCS'21 [Vincent Cohen-Addad et al., 2021]. They paint a varied landscape with big differences between different moments, and with some very nice open problems remaining. - At STOC'93, Farach, Kannan, and Warnow [Martin Farach et al., 1995] proved that under L_∞, we can find the optimal ultrametric. Almost all other variants of the problem are APX-hard. - At SODA'96, Agarwala, Bafna, Farach, Paterson, and Thorup [Richa Agarwala et al., 1999] showed that for any norm L_k, k ≥ 1, if the best ultrametric can be α-approximated, then the best tree metric can be 3α-approximated. In particular, this implied a 3-approximation for tree metrics under L_∞. - At FOCS'05, Ailon and Charikar [Nir Ailon and Moses Charikar, 2011] showed that for any L_k, k ≥ 1, we can get an approximation factor of O(((log n)(log log n))^{1/k}) for both tree and ultrametrics. Their paper was focused on the L₁ norm, and they wrote "Determining whether an O(1) approximation can be obtained is a fascinating question". - At FOCS'21, Cohen-Addad, Das, Kipouridis, Parotsidis, and Thorup [Vincent Cohen-Addad et al., 2021] showed that indeed a constant factor is possible for L₁ for both tree and ultrametrics. This uses the special structure of L₁ in relation to hierarchies. - The status of L_k is wide open for 1 < k < ∞. All we know is that the approximation factor is between Ω(1) and O((log n)(log log n)).

Subject Classification

ACM Subject Classification

Theory of computation → Facility location and clustering
Theory of computation → Approximation algorithms analysis

Keywords

Numerical taxonomy
computational phylogenetics
hierarchical clustering

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Richa Agarwala, Vineet Bafna, Martin Farach, Mike Paterson, and Mikkel Thorup. On the approximability of numerical taxonomy (fitting distances by tree metrics). SIAM J. Comput., 28(3):1073-1085, 1999. Announced at SODA 1996.
Nir Ailon and Moses Charikar. Fitting tree metrics: Hierarchical clustering and phylogeny. SIAM J. Comput., 40(5):1275-1291, 2011. Announced at FOCS 2005.
Vincent Cohen-Addad, Debarati Das, Evangelos Kipouridis, Nikos Parotsidis, and Mikkel Thorup. Fitting distances by tree metrics minimizing the total error within a constant factor. In Proc. 62nd IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 468-479, 2021. URL: https://doi.org/10.1109/FOCS52979.2021.00054.
Martin Farach, Sampath Kannan, and Tandy J. Warnow. A robust model for finding optimal evolutionary trees. Algorithmica, 13(1/2):155-179, 1995. Announced at STOC 1993.