Abstract 1 Introduction 2 Preliminaries 3 𝓞(𝒏𝒌𝟐𝐥𝐨𝐠𝒏)-Time Algorithm 4 Faster Algorithm for Repetitive Inputs 5 Universal Kernel with Improved Repetitiveness Guarantees 6 Summary References

Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime

Tomasz Kociumaka ORCID Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany Ali Shahali ORCID Sharif University of Technology, Tehran, Iran
Abstract

The tree edit distance is a natural dissimilarity measure between rooted ordered trees whose nodes are labeled over an alphabet Σ. It is defined as the minimum number of node edits – insertions, deletions, and relabelings – required to transform one tree into the other. The weighted variant assigns costs 1 to edits (based on node labels), minimizing total cost rather than edit count.

The unweighted tree edit distance between two trees of total size n can be computed in 𝒪(n2.6857) time; in contrast, determining the weighted tree edit distance is fine-grained equivalent to the All-Pairs Shortest Paths (APSP) problem and requires n3/2Ω(logn) time [Nogler, Polak, Saha, Vassilevska Williams, Xu, Ye; STOC’25]. These impractical super-quadratic times for large, similar trees motivate the bounded version, parameterizing runtime by the distance k to enable faster algorithms for kn.

Prior algorithms for bounded unweighted edit distance achieve 𝒪(nk2logn) [Akmal & Jin; ICALP’21] and 𝒪(n+k7logk) [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC’23]. For weighted, only 𝒪(n+k15) is known [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC’23].

We present an 𝒪(n+k6logk)-time algorithm for bounded tree edit distance in both weighted/unweighted settings. First, we devise a simpler weighted 𝒪(nk2logn)-time algorithm. Next, we exploit periodic structures in input trees via an optimized universal kernel: modifying prior 𝒪(n)-time 𝒪(k5)-size kernels to generate such structured instances, enabling efficient analysis.

Keywords and phrases:
tree edit distance, edit distance, kernelization, dynamic programming
Funding:
Ali Shahali: The work of was carried out mostly during a summer internship at the Max Planck Institute for Informatics.
Copyright and License:
[Uncaptioned image] © Tomasz Kociumaka and Ali Shahali; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Related Version:
Full Version: https://arxiv.org/abs/2507.02701 [27]
Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz Herman

1 Introduction

The edit distance between two strings – the minimum cost of insertions, deletions, and substitutions needed to transform one string into the other – is one of the most widely used string similarity measures with numerous algorithmic applications. It provides a robust model for comparing sequential data and underpins techniques in fields such as computational biology, natural language processing, and text correction. Nevertheless, many data types exhibit hierarchical rather than linear structure. For example, RNA secondary structures, syntactic parse trees of natural language sentences, and hierarchical representations of documents and code all store information in tree-like forms. One can linearize such data and still use string edit distance, but the resulting alignments disregard the original hierarchical structure, and thus more expressive similarity measures are needed in most scenarios.

First introduced by Selkow [38], tree edit distance generalizes the notion of string edit distance to rooted ordered trees and forests with nodes labeled over an alphabet Σ. It quantifies the dissimilarity between two forests as the minimal cost of a sequence of node edits – insertions, deletions, and relabelings – required to transform one forest into the other. This distance naturally captures both structural and label-based differences and serves as a core primitive in various algorithmic and applied domains, including computational biology [22, 39, 23, 45], analysis of structured data (such as XML and JSON files) [9, 13, 12, 20, 44], image processing [5, 26, 25, 36], and information extraction [16, 48]; see [6, 2] for surveys.

Tai [41] proposed the first polynomial-time algorithm for computing the tree edit distance: a dynamic programming procedure running in 𝒪(n6) time, where n is the total number of nodes in the input forests. In the following decades, a sequence of works progressively improved the runtime. Zhang and Shasha [49] introduced an 𝒪(n4)-time algorithm, and then Klein [24] brought the complexity down to 𝒪(n3logn). Subsequently, Demaine, Mozes, Rossman, and Weimann [17] presented an 𝒪(n3)-time solution, whereas Bringmann, Gawrychowski, Mozes, and Weimann [8] proved that a hypothetical n3Ω(1)-time algorithm would violate the All-Pairs Shortest Paths (APSP) hypothesis in fine-grained complexity. Very recently, Nogler, Polak, Saha, Vassilevska Williams, Xu, and Ye [33] showed that computing the tree edit distance is, in fact, equivalent to the APSP problem, and the n3/2Ω(logn) running time can be inherited from the state of the art for the latter task [46].

The aforementioned conditional lower bounds apply to the weighted tree edit distance only, where the cost of each edit depends on the involved labels. In the unweighted version with unit costs, a series of recent results achieved truly subcubic running time using fast matrix multiplication. Mao [29] presented an 𝒪(n2.9546)-time solution, which was subsequently improved by Dürr [18] to 𝒪(n2.9148) and by Nogler et al. [33] to 𝒪~(n(3+ω)/2)=𝒪(n2.6857), where ω is the fast matrix multiplication exponent and 𝒪~() hides polylogn factors. The state-of-the-art conditional lower bound, inherited from string edit distance [4] and assuming the Orthogonal Vectors Hypothesis, prohibits n2Ω(1)-time algorithms already if |Σ|=1.

Bounded Tree Edit Distance.

Even though the decades of research resulted in significantly faster tree edit distance algorithms, they also revealed fundamental challenges that explain the difficulty of this problem. A natural way of circumventing these barriers, paved by classical work for string edit distance [43, 31, 28], is to parameterize the running time not only in terms of the length n of the input forests but also the value k of the computed distance. This version of tree edit distance has originally been studied in the unweighted setting only. Touzet [42] adapted the ideas of Zhang and Shasha [49] to derive an 𝒪(nk3)-time algorithm, whereas Akmal and Jin [1] improved the running time to 𝒪(nk2logn) based on the approach of Klein [24]. The latter running time remains the state-of-the-art for medium distances. In the low-distance regime, Das, Gilbert, Hajiaghayi, Kociumaka, Saha, and Saleh [15] achieved an 𝒪~(n+k15)-time solution and subsequently improved the running time to 𝒪(n+k7logk) [14]. They also studied the weighted version of the problem and presented an 𝒪(n+k15)-time algorithm assuming that the weight function is normalized (each edit costs at least one unit; this prohibits arbitrary scaling) and satisfies the triangle inequality.

Still, the 𝒪~(n+poly(k)) running times for tree edit distance are much larger than for string edit distance and the related Dyck edit distance problem, which asks for the minimum cost of edits needed to make a given string of parentheses balanced (so that it represents a node-labeled forest). In the unweighted setting, the state-of-the-art running times are 𝒪(n+k2) for string edit distance [28] and 𝒪(n+k5.442) for Dyck edit distance [21]. In the presence of weights, these complexities increase to 𝒪~(n+nk3)𝒪~(n+k3) [10] and 𝒪(n+k12) [14], respectively. Thus, tree edit distance is a natural candidate for improvements.

Our Results.

As the main contribution, we bring the time complexity of tree edit distance to 𝒪~(n+k6). Even for unit weights, this improves upon the state of the art if n1/7kn1/4.

Theorem 1.1.

There exists a deterministic algorithm that, given two forests F and G with n nodes in total, each with a label from an alphabet Σ, and oracle access to a normalized weight function w:(Σ{ε})20 satisfying the triangle inequality, determines the tree edit distance k𝗍𝖾𝖽w(F,G) in 𝒪(n+k6logk) time.

We also prove that the 𝒪(nk2logn) running time by Akmal and Jin [1] remains valid in the presence of weights. This approach gives the best complexity when n1/4kn.

Theorem 1.2.

There exists a deterministic algorithm that, given two forests F and G, with n nodes in total, each with a label from an alphabet Σ, and oracle access to a normalized weight function w:(Σ{ε})20, determines k𝗍𝖾𝖽w(F,G) in 𝒪(nk2logn) time.

We believe that the original approach of Akmal and Jin [1] supports the weighted setting with minimal adjustments; nevertheless, we prove Theorem 1.2 using a slightly different algorithm that is better-suited for further optimizations. Both solutions are variants of Klein’s dynamic programming [24] with some states pruned. The main difference is that we view the input forests as balanced sequences of parentheses and cast the tree edit distance as the minimum cost of a string alignment satisfying a certain consistency property. In our interpretation (Section 3), it is almost trivial to see that the classic pruning rules, originally devised to compute string edit distance in 𝒪(nk) time [43, 31], can be reused for tree edit distance in 𝒪(nk2logn) time. In contrast, Akmal and Jin [1, Lemma 9] spend over four pages (on top of the analysis of [24]) to bound the number of states surviving their pruning rules.

Theorem 1.2 combined with the results of [14] lets us derive an 𝒪(n+k7logk)-time algorithm for weighted tree edit distance, matching the previously best running time for unit weights. This is due to the universal kernel that transforms the input forests to equivalent forests of size 𝒪(k5) preserving the following capped tree edit distance value:

𝗍𝖾𝖽kw(F,G)={𝗍𝖾𝖽w(F,G)if 𝗍𝖾𝖽w(F,G)k,otherwise.
Theorem 1.3 ([14, Corollary 3.20]).

There exists a linear-time algorithm that, given forests F,G and an integer k+, constructs forests F,G of size 𝒪(k5) such that 𝗍𝖾𝖽kw(F,G)=𝗍𝖾𝖽kw(F,G) holds for every normalized weight function w satisfying the triangle inequality.

To bring the time complexity from 𝒪~(n+k7) to 𝒪~(n+k6), we adapt both Theorems 1.2 and 1.3. The main novel insight is that the dynamic-programming procedure behind Theorem 1.2 behaves predictably while processing regions with certain repetitive (periodic) structure. Upon an appropriate relaxation of the invariant that the dynamic-programming values satisfy, processing each repetition of the period can be interpreted as a single min-plus matrix-vector multiplication. When the period repeats many times, the same matrix is used each time, and thus we can raise the matrix to an appropriate power (with fast exponentiation) and then compute a single matrix-vector product; see Section 4 for details. This can be seen as a natural counterpart of an optimization used in [10] for string edit distance.111The matrices arising in the string edit distance computation in [10] are 𝒪(k)×𝒪(k) Monge matrices, which allows for computing each matrix-vector product in 𝒪(k) time and each matrix-matrix product in 𝒪(k2) time. In the context of tree edit distance, the Monge property is no longer satisfied, so the complexities increase to 𝒪(k2) and 𝒪(k3) respectively; in fact, computing the (weighted) tree edit distance is, in general, as hard as computing the min-plus product of two n×n arbitrary matrices [33].

Intuitively, the worst-case instances for the kernelization algorithm of Theorem 1.3 are close to having the necessary structure for our optimization to improve the worst-case running time. Unfortunately, the implementation in [14] controls the forest sizes only, and it needs to be modified to keep track of the more subtle instance difficulty measure. The biggest challenge is that the original algorithm proceeds in logarithmically many steps, each shrinking the forests by a constant fraction. In every step, the forests are partitioned into 𝒪(k) pieces of size 𝒪(n/k), and a constant fraction of pieces is shrunk to size 𝒪(k4) each. A partition into pieces of equal “difficulty” would be much more challenging, so we instead take a different approach that lets us implement the reduction in a single step; see Section 5 for details. Notably, the kernelization algorithms for weighted string and Dyck edit distance already have single-step implementations in [14], but they crucially rely on the fact that the unit-weight variants of these problems can be solved faster, unlike for tree edit distance.

Related Work

The classical definition of tree edit distance [38] involves labeled rooted trees (or forests) with node labels. Many other variants have also been studied allowing, among others, unordered [47, 40], unrooted [40], and edge-labeled [3] trees; see also the surveys [2, 6].

The tree edit distance problem becomes easier not only when the computed distance is small but also when the input forests are of small depth [49, 11] or when the approximate distance suffices [3, 7, 37]. There is also a body of work focusing on practical performance and empirical evaluation, including sequential [34, 35] and parallel [19] implementations.

2 Preliminaries

Consistently with [14], we identify forests with the underlying Euler tours, interpreted as balanced strings of parentheses. This representation is at the heart of space-efficient data structures on trees [30, 32], and it allows for a simple definition of tree edits and a seamless transfer of many tools from strings to forests. Notably, Klein’s algorithm [24] already uses substrings of the Euler tours of the input forests to identify the dynamic-programming states.

For an alphabet Σ, let 𝖯Σ:=aΣ{(a,)a} denote the set parentheses with labels in Σ. As formalized next, a forest with node labels over Σ is a balanced string of parentheses in 𝖯Σ.

Definition 2.1.

The set of forests with labels over Σ (balanced strings over 𝖯Σ) is the smallest subset Σ𝖯Σ satisfying the following conditions:

  • εΣ,

  • FGΣ for every F,GΣ,

  • (aF)aΣ for every FΣ and aΣ.

For a forest F, we define the set of nodes VF as the set of intervals [i..j][0..|F|) such that F[i] is an opening parenthesis, F[j] is a closing parenthesis, and F(i..j) is balanced. A forest F is a tree if [0..|F|)VF. For a node u=[i..j]VF, we denote the positions of the opening and the closing parenthesis by o(u):=i and c(u):=j. The label of u, that is, the character aΣ such that F[o(u)]=(a and F[c(u)]=)a, is denoted by 𝗅𝖻𝗅(u). A simple inductive argument shows that VF forms a laminar family and, for every i[0..|F|), there is a unique node uVF such that i{o(u),c(u)}; we denote this node by 𝗇𝗈𝖽𝖾F(i). We also write 𝗆𝖺𝗍𝖾F(i) for the paired parenthesis, that is, 𝗆𝖺𝗍𝖾F(i)=j if {i,j}={o(u),c(u)}.

We say that a node uVF is contained in a fragment F[i..j) of a forest F if io(u)<c(u)<j; we denote by VF[i..j)VF the set of nodes contained in F[i..j). Moreover, u enters F[i..j) if o(u)<ic(u)<j and u exits F[i..j) if io(u)<jc(u). In either of these two cases, we also say that u straddles F[i..j).

We denote by F[i..j) the subforest of F induced by F[i..j), which we obtain from F by deleting (the parentheses corresponding to) all nodes except for those contained in F[i..j). Alternatively, one can obtain F[i..j) from F[i..j) by deleting the opening parenthesis of every node that exits F[i..j) and the closing parenthesis of every node that enters F[i..j).

2.1 Tree Edits, Forest Alignments, and Tree Edit Distance

For an alphabet Σ, we define Σ¯:=Σ{ε}, where ε is the empty string. We say that a function w:Σ¯20 is normalized if w(a,a)=0 and w(a,b)1 hold for distinct a,bΣ¯.

Tree edit distance is classically defined using elementary edits transforming FΣ:

Node insertion

produces F[0..i)(aF[i..j))aF[j..|F|) for a balanced fragment F[i..j) and a label aΣ, at cost w(ε,a).

Node relabeling

produces F[0..o(u))(aF(o(u)..c(u)))aF(c(u)..|F|) for a node uVF and a label aΣ, at cost w(𝗅𝖻𝗅(u),a).

Node deletion

produces F[0..o(u))F(o(u)..c(u))F(c(u)..|F|) for a node uVF, at cost w(𝗅𝖻𝗅(u),ε).

The tree edit distance 𝗍𝖾𝖽w(F,G) of two forests F,GΣ is then defined as the minimum cost of a sequence of edits transforming F to G. In this context, without loss of generality, we can replace w by its transitive closure. If, say w(a,ε)>w(a,b)+w(b,ε), instead of directly deleting a node with label a, it is more beneficial to first change its label to b and only then perform the deletion. When w satisfies the triangle inequality, we are guaranteed that an inserted or relabeled node is never modified (deleted or relabeled) again. Consistently with modern literature [8, 14, 33], we use a more general alignment-based definition of 𝗍𝖾𝖽w(F,G) that enforces the latter condition even if w does not necessarily satisfy the triangle inequality.

Definition 2.2 (Alignment Graph [10]).

For strings X,YΣ and a weight function w:Σ¯20, we define the alignment graph 𝖠𝖦w(X,Y) as a grid graph with vertices [0..|X|]×[0..|Y|] and the following directed edges:

  • horizontal edges (x,y)(x+1,y) of cost w(X[x],ε) for (x,y)[0..|X|)×[0..|Y|],

  • vertical edges (x,y)(x,y+1) of cost w(ε,Y[y]) for (x,y)[0..|X|]×[0..|Y|), and

  • diagonal edges (x,y)(x+1,y+1) of cost w(X[x],Y[y]) for (x,y)[0..|X|)×[0..|Y|).

The alignment graph allows for a concise definition of a string alignment.

Definition 2.3 (Alignment).

For strings X,YΣ and a weight function w:Σ¯20, an alignment of X[x..x) onto Y[y..y), denoted by 𝒜:X[x..x)[Uncaptioned image]Y[y..y), is a path from (x,y) to (x,y) in 𝖠𝖦w(X,Y), interpreted as a sequence of vertices. The cost 𝖾𝖽𝒜w(X[x..x),Y[y..y)) of the alignment 𝒜 is the total costs of the edges that belong to 𝒜.

We write 𝐀(X[x..x),Y[y..y)) for the set of all alignments of X[x..x) onto Y[y..y).

The edges of an alignment 𝒜=𝐀(X[x..x),Y[y..y)) can be interpreted as follows:

  • If 𝒜 includes an edge (x^,y^)(x^+1,y^) for some x^[x..x) and y^[y..y], then 𝒜 deletes X[x^], denoted by X[x^][Uncaptioned image]ε𝒜.

  • If 𝒜 includes an edge (x^,y^)(x^,y^+1) for some x^[x..x] and y^[y..y), then 𝒜 inserts Y[y^], denoted by ε[Uncaptioned image]Y𝒜[y^].

  • If 𝒜 includes an edge (x^,y^)(x^+1,y^+1) for some x^[x..x) and y^[y..y), then 𝒜 aligns X[x^] to Y[y^], denoted by X[x^][Uncaptioned image]Y𝒜[y^]. If X[x^]Y[y^], then 𝒜 substitutes X[x^] for Y[y^]. If X[x^]=Y[y^], then 𝒜 matches X[x^] with Y[y^].

Insertions, deletions, and substitutions are jointly called character edits.

The weighted edit distance of fragments X[x..x) and Y[y..y) with respect to a weight function w:Σ¯20 is defined as the minimum cost of an alignment:

𝖾𝖽w(X[x..x),Y[y..y))=min𝒜𝐀(X[x..x),Y[y..y))𝖾𝖽𝒜w(X[x..x),Y[y..y)).

We often consider alignments of the entire string X onto the entire string Y; we then used simplified notation including 𝐀(X,Y), 𝖾𝖽𝒜w(X,Y), and 𝖾𝖽w(X,Y).

Definition 2.4 (Forest alignment).

Consider forests F,GΣ. An alignment 𝒜𝐀(F[f..f),G[g..g)) is a forest alignment if it satisfies the following consistency condition:

For every two aligned characters F[f^][Uncaptioned image]G𝒜[g^], also F[𝗆𝖺𝗍𝖾F(f^)][Uncaptioned image]G𝒜[𝗆𝖺𝗍𝖾G(g^)].

We write 𝐅𝐀(F[f..f),G[g..g))𝐀(F[f..f),G[g..g)) for the set of forest alignments.

 Remark 2.5.

Consider a forest alignment 𝒜𝐅𝐀(F[f..f),G[g..g)). Then,

  • 𝒜 deletes every character F[f^] with f^[f..f) and 𝗆𝖺𝗍𝖾F(f^)[f..f),

  • 𝒜 inserts every character G[g^] with g^[g..g) and 𝗆𝖺𝗍𝖾G(g^)[g..g).

Define 𝖯Σ¯=𝖯Σ{ε} and a mapping λ:𝖯Σ¯Σ¯ such that λ((a)=λ()a)=a for each aΣ, and λ(ε)=ε. For a weight function w:Σ¯20, we define a corresponding weight function w~:𝖯Σ¯20 so that w~(p,q)=12w(λ(p),λ(q)) for all p,q𝖯Σ¯. The cost of a forest alignment 𝒜𝐅𝐀(F[f..f),G[g..g)) with respect to a weight function w is defined as 𝗍𝖾𝖽𝒜w(F[f..f),G[g..g)):=𝖾𝖽𝒜w~(F[f..f),G[g..g)). Moreover, we define

𝗍𝖾𝖽w(F[f..f),G[g..g))=min𝒜𝐅𝐀(F[f..f),G[g..g))𝗍𝖾𝖽𝒜w(F[f..f),G[g..g)).

For a threshold k0, we set

𝗍𝖾𝖽kw(F[f..f),G[g..g))={𝗍𝖾𝖽w(F[f..f),G[g..g))if 𝗍𝖾𝖽w(F[f..f),G[g..g))k,otherwise.

By Remark 2.5, the following value is non-negative for every 𝒜𝐅𝐀(F[f..f),G[g..g)):

𝗍𝖾𝖽~𝒜w(F[f..f),G[g..g))
𝗍𝖾𝖽𝒜w(F[f..f),G[g..g))f^[f..f):𝗆𝖺𝗍𝖾F(f^)[f..f)w~(F[f^],ε)g^[g..g):𝗆𝖺𝗍𝖾G(g^)[g..g)w~(ε,G[g^]).

We naturally generalize this value to 𝗍𝖾𝖽~w(F[f..f),G[g..g)).

Observation 2.6.

For all forests F,GΣ, fragments F[f..f) and G[g..g), and weight functions w:Σ¯20, we have 𝗍𝖾𝖽~w(F[f..f),G[g..g))=𝗍𝖾𝖽w(F[f..f),G[g..g)).

3 𝓞(𝒏𝒌𝟐𝐥𝐨𝐠𝒏)-Time Algorithm

In this section, we reinterpret Klein’s algorithm [24] and develop its 𝒪(nk2logn)-time variant.

3.1 Klein’s Algorithm

Klein’s algorithm [24] uses dynamic programming to compute 𝗍𝖾𝖽(F[lF..rF),G[lG..rG))=𝗍𝖾𝖽~(F[lF..rF),G[lG..rG)) for 𝒪(nlogn) selected fragments F[lF..rF) of F and all 𝒪(n2) fragments G[lG..rG) of G. We modify the algorithm slightly so that the computed value 𝗍𝖾𝖽(F[lF..rF),G[lG..rG)) includes the costs of deleting the single parentheses of nodes straddling F[lF..rF) and inserting the single parentheses of nodes straddling G[lG..rG).

Algorithm 1 Klein(lF,rF,lG,rG): Klein’s algorithm for computing the tree edit distance.

The algorithm’s implementation is provided in Algorithm 1. We use an operator xminy that assigns xy if y<x. We also remember which of these assignments were applied so that we can later trace back the optimal alignment based on this extra information stored.

In the corner case when F[lF..rF) or G[lG..rG) is empty, then the unique (and thus optimal) forest alignment pays for inserting or deleting all characters in the other fragment. If both F[lF..rF) and G[lG..rG) are non-empty, the algorithm considers nodes uF=𝗇𝗈𝖽𝖾F(lF), uG=𝗇𝗈𝖽𝖾G(lG), vF=𝗇𝗈𝖽𝖾F(rF1), and vG=𝗇𝗈𝖽𝖾G(rG1).

Let us first suppose that uF and vF are contained in F[lF..rF). If the subtree of vF is at least as large as the subtree of uF, that is, 𝗌𝗂𝗓𝖾(vF)𝗌𝗂𝗓𝖾(uF), where 𝗌𝗂𝗓𝖾(x)=c(x)o(x)+1, then algorithm considers three possibilities: F[lF] is deleted, G[lG] is inserted, or F[lF] is aligned with G[lG]. In the first two possibilities, we can pay for the deleted or inserted opening parenthesis and align the remaining fragments; see Lines 67. In the third possibility, the consistency condition in the definition of the forest alignments requires that F[𝗆𝖺𝗍𝖾F(lF)] is also aligned with G[𝗆𝖺𝗍𝖾G(lG)]. In particular, the node uG needs to be contained in G[lG..rG) so that 𝗆𝖺𝗍𝖾G(lG)=c(uG) and 𝗆𝖺𝗍𝖾F(lF)=c(uF). Thus, we align F(lF..c(uF)) with G(lG..c(uG)), align F(c(uF)..rF) with G(c(uG)..rG), and pay for aligning uF with uG, that is F[lF] with G[lG] and F[c(uF)] with G[c(uG)]; see Line 9. If 𝗌𝗂𝗓𝖾(vF)<𝗌𝗂𝗓𝖾(uF), the algorithm handles vF and vG in a symmetric way; see Lines 1014.

If uFVF[lF..rF), we follow Lines 59 and effectively consider deleting F[lF] and inserting G[lG]. Else, if vFVF[lF..rF), we follow the symmetric Lines 1014.

Zhang and Shasha [49] use a very similar dynamic programming; in a baseline version, their algorithm always constructs the alignment from left to right instead of picking the side depending on uF and vF. Klein’s optimization allows for an improved running time of 𝒪(n3logn) instead of 𝒪(n4). In the full version [27], we recall the proof of the following lemma.

Lemma 3.1.

The recursive implementation of Algorithm 1 visits 𝒪(nlogn) fragments of F.

Akmal and Jin’s Algorithm.

Akmal and Jin [1] reduce the time complexity of Klein’s algorithm to 𝒪(nk2logn) when computing 𝗍𝖾𝖽k(F,G). Their solutions prunes some dynamic programming states so that, in the surviving states, the differences between the sizes of the subforests F[lF..rF) and G[lG..rG) is at most k. This condition also holds for the difference between the sizes of F[0..rF)F[lF..rF) and G[0..lG)G[lG..rG), as well as between F[lF..|F|)F[lF..rF) and G[lG..|G|)G[lG..rG); see [1, Lemma 12]. As shown in [1, Lemma 13], for each subforest F[lF..rF), there are 𝒪(k2) subforests G[lG..rG) satisfying these three conditions. With a careful implementation, the total number of states becomes 𝒪(nk2logn).

3.2 Our Algorithm

Our variant of Klein’s algorithm simply prunes all states with |lFlG|>2k or |rFrG|>2k. In other words, we modify Algorithm 1 so that 𝖽𝗉[lF,rF,lG,rG] is set to if either condition holds, and the existing instructions are executed otherwise. This does not increase the number 𝒪(nlogn) of visited fragments F[lF..rF); see Lemma 3.1. For each of these fragments, trivially, at most 𝒪(k2) fragments of G survive pruning, so the total number of states is 𝒪(nk2logn). The correctness of the pruning rules follows from the fact that, for all forests F,G, the width of any alignment in 𝐅𝐀(F,G) does not exceed twice its cost, where the width of an alignment 𝒜𝐀(X[x..x),Y[y..y)) is defined as 0pt(𝒜)=max{|x^y^|:(x^,y^)𝒜}.

In the full version [27], we formalize this intuition using the notion of bounded forest alignments.

Definition 3.2.

Let us fix a threshold k+ and consider fragments F[f..f) and G[g..g) of forests F,GΣ. We call a forest alignment 𝒜𝐅𝐀(F[f..f),G[g..g)) bounded if 0pt(𝒜)2k. The family of bounded forest alignments is 𝐁𝐅𝐀k(F[f..f),G[g..g))𝐅𝐀(F[f..f),G[g..g)). For a weight function w:Σ¯20, we denote

𝖻𝗍𝖾𝖽kw(F[f..f),G[g..g))=min𝒜𝐁𝐅𝐀k(F[f..f),G[g..g))𝗍𝖾𝖽𝒜w(F[f..f),G[g..g)).
Lemma 3.3.

The 𝖽𝗉 values computed using the pruned version of Algorithm 1 satisfy

𝗍𝖾𝖽w(F[lF..rF),G[lG..rG))𝖽𝗉[lF,rF,lG,rG]𝖻𝗍𝖾𝖽kw(F[lF..rF),G[lG..rG)).

In particular, the 𝖽𝗉[0,|F|,0,|G|] entry stores a value between 𝗍𝖾𝖽w(F,G) and 𝖻𝗍𝖾𝖽kw(F,G). The following observation implies that this is enough to retrieve 𝗍𝖾𝖽kw(F,G).

Observation 3.4.

If 𝗍𝖾𝖽w(F,G)k, then 𝗍𝖾𝖽w(F,G)=𝖻𝗍𝖾𝖽kw(F,G).

Proof.

Consider the optimal forest alignment 𝒜𝐅𝐀(F,G) with 𝗍𝖾𝖽𝒜w(F,G)k. For each edge (f,g)(f,g), either fg=fg (if the edge is diagonal) or |(fg)(fg)|=1 and the edge cost is at least 12 (if the edge is vertical or horizontal). Since (0,0)𝒜 and the total cost of edges in 𝒜 is at most k, every (f,g)𝒜 satisfies |fg|2k, i.e., 0pt(𝒜)2k and 𝒜𝐁𝐅𝐀k(F,G). Hence, 𝖻𝗍𝖾𝖽kw(F,G)𝗍𝖾𝖽𝒜w(F,G)=𝗍𝖾𝖽w(F,G). The converse inequality 𝗍𝖾𝖽w(F,G)𝖻𝗍𝖾𝖽kw(F,G) holds trivially due to 𝐁𝐅𝐀k(F,G)𝐅𝐀(F,G).

With minor implementation details needed to avoid logarithmic overheads for memoization using a sparse table 𝖽𝗉 (Akmal and Jin [1] ignore this issue), we achieve the following result.

Theorem 1.2. [Restated, see original statement.]

There exists a deterministic algorithm that, given two forests F and G, with n nodes in total, each with a label from an alphabet Σ, and oracle access to a normalized weight function w:(Σ{ε})20, determines k𝗍𝖾𝖽w(F,G) in 𝒪(nk2logn) time.

4 Faster Algorithm for Repetitive Inputs

In this section, we present an optimized version of our 𝒪(nk2logn)-time algorithm capable of exploiting certain repetitive structures within the input forests F and G. The following notion of free pairs captures the structure that our algorithm is able to utilize.

Definition 4.1 (Free pair, free block).

Consider forests F,GΣ and a fixed threshold k+. We call a pair of fragments F[pF..qF) and G[pG..qG) a free pair if

  • |pFpG|2k, and

  • there exists a balanced string RΣ with 4k|R|<8k such that F[pF|R|..qF+|R|)=Re+2=G[pG|R|..qG+|R|) holds for some integer exponent e+.

For that free pair, we call the fragment F[pF..qF) a free block with period R and exponent e.

Our improved algorithm assumes that the input forests F and G are augmented with a collection 𝐅 of disjoint free blocks F[pF..qF), each associated with the underlying period R, exponent e, and the corresponding fragment G[pG..qG). The speed-up compared to the algorithm of Section 3 is noticeable if the free blocks in 𝐅 jointly cover most of the characters of F, that is, the number of remaining non-free characters is asymptotically smaller than |F|.

Theorem 4.2.

There exists a deterministic algorithm that, given forests F,GΣ of total length n, oracle access to a normalized weight function w:Σ¯20, a threshold k+, and a collection 𝐅 of t disjoint free blocks in F such that m characters of F are not contained in any free block, computes 𝗍𝖾𝖽kw(F,G) in 𝒪(nlogn+mk2logn+tk3logn) time.

Intuitively, the algorithm skips all the free blocks, which allows reducing the 𝒪(nk2logn) running time of Theorem 1.2 to 𝒪(mk2logn), i.e., the algorithm needs to pay for non-free characters only. Nevertheless, processing each free block takes 𝒪(k3logn) extra time, and 𝒪(nlogn)-time preprocessing time is still needed to avoid overheads for memoization.

Processing Free Blocks.

To understand our speed-up, let us consider a free pair with period R and exponent e, that is, F[pF..qF)=Re=G[pG..qG). We picked the parameters so that |R|4k0pt(𝒜)+|pFpG|, and thus every alignment 𝒜𝐁𝐅𝐀k(F,G) aligns F[pF..qF)=Re with a fragment G[pG..qG) contained in G[pG|R|..qG+|R|)=Re+2. By the same argument, the image of every copy of R within F[pF..qF)=Re is contained within the corresponding copy of R3 within G[pG|R|..qG+|R|)=Re+2. Moreover, when we align F[pF..qF) to G[pF..qF)G[pG|R|..qG+|R|), it suffices to partition G[pF..qF) into e fragments and independently optimally align each copy of R in F[pF..qF) with the corresponding fragment of G[pF..qF). This is because R is balanced, so every node of Re is contained within a single copy of R, and thus the consistency condition in Definition 2.4 does not impose any constraints affecting multiple copies of R.

The optimal costs of aligning R with relevant fragments of R3 can be encoded in the following matrix, constructible in 𝒪(|R|3log|R|) time using Algorithm 1 (Klein’s algorithm).

Definition 4.3.

For a balanced string RΣ, we define a matrix MR of size (2|R|+1)×(2|R|+1) with indices i,j in the range [|R|..|R|] as follows:

MR[i,j]={𝗍𝖾𝖽w(R,R3[|R|+i..2|R|+j))if |R|+i2|R|+j,otherwise.

In order to derive the optimal costs of aligning Re=F[pF..qF) with the relevant fragments of Re+2=G[pG|R|..qG+|R|), we simply compute the e-th power of MR with respect to the min-plus product. Formally, the min-plus product of matrices AI×J and BJ×K is a matrix CI×K such that C[i,k]=minjJA[i,j]+B[j,k] for (i,k)I×K.

For each free block F[pF..qF)=Re𝐅, we construct the matrix MRe in 𝒪(k3logn) time using Algorithm 1 followed by fast exponentiation, which reduces to computing 𝒪(loge) min-plus products. We apply the matrix whenever we are tasked with filling a 𝖽𝗉[lF,rF,lG,rG] entry such that F[pF..qF) is a prefix or a suffix of F[lF..rF). Formally, if lFpF<qF=rF, then we use the following formula instead of following Algorithm 1:

𝖽𝗉[lF,rF,lG,rG]minpG[pF|R|..pF+|R|]𝖽𝗉[lF,pF,lG,pG]+MRe[pGpG,rGqG]. (1)

In the symmetric case of lF=pF<qFrF, we apply

𝖽𝗉[lF,rF,lG,rG]minqG[qF|R|..qF+|R|]𝖽𝗉[qF,rF,qG,rG]+MRe[lGpG,qGqG]. (2)

In the full version [27], we formalize the intuition above to prove that Equations 1 and 2 preserve the invariant of Lemma 3.3, that is,

𝗍𝖾𝖽w(F[lF..rF),G[lG..rG))𝖽𝗉[lF,rF,lG,rG]𝖻𝗍𝖾𝖽kw(F[lF..rF),G[lG..rG)).

For (1), this boils down to the following lemma; the case of (2) is symmetric.

Lemma 4.4.

Consider fragments F[lF..rF), G[lG..rG) of forests F,GΣ, a normalized weight function w:Σ¯20, and a threshold k+ such that |rFrG|2k. If there is a free pair F[pF..qF)=G[pG..qG)=Re such that F[pF..qF) is a suffix of F[lF..rF), then

𝖻𝗍𝖾𝖽kw(F[lF..rF),G[lG..rG))
minpG[pF|R|..pF+|R|]𝖻𝗍𝖾𝖽kw(F[lF..pF),G[lG..pG))+MRe[pGpG,rGqG]
minpG[pF|R|..pF+|R|]𝗍𝖾𝖽w(F[lF..pF),G[lG..pG))+MRe[pGpG,rGqG]
𝗍𝖾𝖽w(F[lF..rF),G[lG..rG)).

We further argue in the full version [27] that the recursive implementation of Algorithm 1 augmented with the optimizations of (1) and (2) visits 𝒪(mlogn) fragments F[lF..rF), where m is the number of non-free characters, including 𝒪(tlogn) fragments F[lF..rF) for which (1) or (2) apply. Specifically, our optimized algorithm visits fragments F[lF..rF) visited by the original Algorithm 1 and satisfying the following additional property: every free block F[pF..qF)𝐅 is either disjoint with F[lF..rF) or contained in F[lF..rF). Our implementation takes 𝒪(nlogn) extra preprocessing time to list fragments visited by Algorithm 1 and filter those satisfying the aforementioned property.

5 Universal Kernel with Improved Repetitiveness Guarantees

In this section, we outline our approach to strengthen Theorem 1.3 into the following result:

Theorem 5.1.

There exists a linear-time algorithm that, given forests F, G and an integer k+, constructs forests F, G of size 𝒪(k5) such that 𝗍𝖾𝖽kw(F,G)=𝗍𝖾𝖽kw(F,G) holds for every normalized quasimetric w (weight function satisfying the triangle inequality), and a collection of 𝒪(k3) disjoint free blocks in F with 𝒪(k4) non-free characters.

Compared to Theorem 1.3, we require the presence of 𝒪(k3) free blocks that jointly capture all but 𝒪(k4) characters of the output forest F; consult Definition 4.1. At the very high level, the proofs of both Theorems 1.3 and 5.1 consist in three steps: decomposing the input forests F and G into several pieces, identifying pieces that can be matched exactly, and replacing (pairs of) identical pieces with smaller equivalent counterparts.

Forest Decompositions and Piece Matchings.

Following [14], we say that a piece of a forest F is a subforest – a balanced fragment F[i..j) – or a context – a pair of fragments F[i..i);F[j..j) such that F[i..j) is a tree and F[i..j) is balanced. We denote the set of pieces contained in a fragment F[i..j) of F by 𝒫(F[i..j)); we set 𝒫(F)=𝒫(F[0..|F|)).

In isolation from F, a context can be interpreted as a pair of non-empty strings C=CL;CR𝖯Σ+×𝖯Σ+ such that CLCR is a tree. The composition of contexts C,D results in a context CD:=CLDL;DRCR. Moreover, the composition of a context C and a forest H results in a tree CH:=CLHCR. For any decomposition of a forest F into disjoint pieces, one can recover F using the concatenation and composition operations.

We define the depth of a context C=CL;CR to be the number nodes of CLCR with the opening parenthesis in CL and the closing parenthsis in CR. Note that the depth of the context CD is equal to the sum of the depths of C and D.

The following notion formalizes the concept of a matching between pieces of F and G.

Definition 5.2.

For two forests F and G and a fixed threshold k+, a piece matching between F and G is a set of pairs 𝒫(F)×𝒫(G) such that:

  • across all pairs (f,g), the pieces f𝒫(F) are pairwise disjoint, and

  • there exists a forest alignment 𝒜𝐅𝐀(F,G) of width at most 2k (i.e., 𝒜𝐁𝐅𝐀k(F,G)) that matches f to g perfectly for every (f,g).

The kernelization algorithm behind Theorem 1.3 repeatedly identifies a piece matching of size ||=𝒪(k) covering Ω(n) vertices of F and replaces each pair of matching pieces (f,g) with a pair of “equivalent” pieces (f,g) of size 𝒪(k4). After 𝒪(logn) steps, this yields forests of size 𝒪(k5). Our strategy relies on the following new result:

Theorem 5.3.

There exists a linear-time algorithm that, given forests F,GΣ and a threshold k0, either certifies that 𝗍𝖾𝖽(F,G)>k or constructs a size-𝒪(k) piece matching between F and G that leaves 𝒪(k4) unmatched characters.

The proof of Theorem 5.3, presented in the full version [27], reuses a subroutine of [14] to construct a decomposition 𝒟𝒫(F) of F into 𝒪(n/k3) pieces of size 𝒪(k3) each. The next step is to build a piece matching 𝒟×𝒫(G) that leaves at most k pieces of 𝒟 unmatched. We modify a dynamic-programming procedure from [14] so that, additionally, the unmatched characters of G form 𝒪(k) fragments. As a result, even though the obtained matching is of size ||=𝒪(n/k3), it is possible to reduce its size to 𝒪(k). For this, it suffices to repeatedly identify pairs of adjacent pieces f,f𝒫(F) matched to adjacent pieces g,g𝒫(G), and then replace these pieces with their unions ff and gg (as formalized in the full version [27], two disjoint pieces are adjacent if their union, consisting of the characters contained in at least one of these pieces, can be interpreted as a piece).

Periodic Blocks and Red Characters.

The main ingredient of our kernelization algorithm is a procedure that replaces a pair of matching pieces (f,g) with a pair of “equivalent” smaller pieces (f,g). Unlike [14], where the goal was to reduce the piece size to 𝒪(k4), we aim to identify 𝒪(k2) disjoint free blocks with 𝒪(k3) non-free nodes within the replacement piece f. Unfortunately, free blocks lack a canonical construction, and it would be tedious to maintain a specific selection while the forests change. Instead, we use the following notions:

Definition 5.4.

For a fixed threshold k+, we say that a string S𝖯Σ forms a periodic block if it satisfies the following properties:

  • |S|42k, that is, the fragment is of length at least 42k, and

  • S has a string period of length at most 4k with equally many opening and closing parentheses.

For a string T𝖯Σ, we denote by 𝐁k(T) the set of fragments of T that are periodic blocks. For a context C=CL;CR, we denote by 𝐁k(C) the disjoint union of 𝐁k(CL) and 𝐁k(CR).

We partition the characters of T𝖯Σ into black and red based on the family 𝐁k(T).

Definition 5.5.

A character T[i] of a string T𝖯Σ is black if there exists a periodic block T[l..r)𝐁k(T) such that i[l+5k..r5k). The remaining characters are red. We denote by 𝖻𝗅𝖺𝖼𝗄k(T) and 𝗋𝖾𝖽k(T) the sets of black and red characters of T.

For a context C=CL;CR, the sets 𝖻𝗅𝖺𝖼𝗄k(C)=𝖻𝗅𝖺𝖼𝗄k(CL)𝖻𝗅𝖺𝖼𝗄k(CR) and 𝗋𝖾𝖽k(C)=𝗋𝖾𝖽k(CL)𝗋𝖾𝖽k(CR) are defined as disjoint unions.

Piece Reduction.

The following definitions in [14] formalize the concept of equivalent pieces.

Definition 5.6 ([14, Definition 3.4]).

For a threshold k0 and a weight function w, forests P,P are called 𝗍𝖾𝖽kw-equivalent if

𝗍𝖾𝖽kw(F,G)=𝗍𝖾𝖽kw(F[0..lF)PF[rF..|F|),G[0..lG)PG[rG..|G|))

holds for all forests F and G with matching pieces F[lF..rF)=P=G[lG..rG) satisfying |lFrG|2k.

Definition 5.7 ([14, Definition 3.9]).

For a threshold k0 and a weight function w, contexts P=PL;PR and P=PL;PR are called 𝗍𝖾𝖽kw-equivalent if

𝗍𝖾𝖽kw(F,G)=𝗍𝖾𝖽kw(F[0..lF)PLF[lF..rF)PRF[rF..|F|),G[0..lG)PLG[lG..rG)PRG[rG..|G|))

holds for all forests F and G with matching pieces F[lF..lF);F[rF..rF)=P=G[lG..lG);G[rG..rG) satisfying |lFlG|2k and |rFrG|2k.

In the full version [27], we slightly modify the arguments of [14] to prove the following results:

Lemma 5.8 (see [14, Lemma 3.17]).

There is a linear-time algorithm that, given a forest P and a threshold k+, computes a forest P with |P||P| and |𝗋𝖾𝖽k(P)|158k2 such that P and P are 𝗍𝖾𝖽kw-equivalent for every normalized quasimetric weight function w.

Lemma 5.9 (see [14, Lemma 3.18]).

There is a linear-time algorithm that, given a context P and a threshold k+, computes a context P with |P||P| and |𝗋𝖾𝖽k(P)|1152k3 such that P and P are 𝗍𝖾𝖽kw-equivalent for every normalized quasimetric weight function w.

Complete Kernelization Algorithm.

In the full version [27], we combine the above ingredients to formally prove Theorem 5.1. Our procedure first applies Theorem 5.3. Then, for every pair of matched pieces (f,g), we use Lemma 5.8 or 5.9 (depending on piece type) to obtain an equivalent piece P with 𝒪(k3) red characters. The remaining (black) characters in P can be traced back to 𝒪(k2) periodic blocks and each periodic block within P yields a free block in the output forest F that covers the underlying black characters.

6 Summary

Theorem 1.1. [Restated, see original statement.]

There exists a deterministic algorithm that, given two forests F and G with n nodes in total, each with a label from an alphabet Σ, and oracle access to a normalized weight function w:(Σ{ε})20 satisfying the triangle inequality, determines the tree edit distance k𝗍𝖾𝖽w(F,G) in 𝒪(n+k6logk) time.

Proof.

First, suppose that the task is to compute 𝗍𝖾𝖽kw(F,G) for a given threshold k. In this case, we use the algorithm of Theorem 5.1, resulting in a pair of forests F,G of size 𝒪(k5) such that 𝗍𝖾𝖽kw(F,G)=𝗍𝖾𝖽kw(F,G), as well as a collection of 𝒪(k3) free blocks in F with 𝒪(k4) non-free characters. Based on this, the algorithm of Theorem 4.2 computes 𝗍𝖾𝖽kw(F,G)=𝗍𝖾𝖽kw(F,G) in 𝒪(k5logk5+k4k2logk5+k3k3logk5)=𝒪(k6logk) time. Including the 𝒪(n) running time of Theorem 5.1, we get 𝒪(n+k6logk) total time.

In the absence of a given threshold, we consider a geometric sequence of thresholds (di)i0, with di=2i(n/logn)1/6, and we compute 𝗍𝖾𝖽diw(F,G) for subsequent i0 until 𝗍𝖾𝖽djw(F,G)dj holds for some j0, which indicates 𝗍𝖾𝖽w(F,G)=𝗍𝖾𝖽djw(F,G).

Since d0=𝒪((n/logn)1/6), the initial iteration costs 𝒪(n+d06logd0)=𝒪(n) time. Consequently, if kd0, then the whole algorithm runs in 𝒪(n) time.

Due to did0(n/logn)1/6, the running time of the ith iteration iteration is 𝒪(di6logdi). This sequence grows geometrically, so the total running time of the algorithm is dominated by the running time of the last iteration, which is 𝒪(dj6logdj). If k>d0, then j>0 and, since the algorithm has not terminated one iteration earlier, dj=2dj1<2k. Consequently, the overall running time is 𝒪(k6logk) when k>d0.

References

  • [1] Shyan Akmal and Ce Jin. Faster algorithms for bounded tree edit distance. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 12:1–12:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICALP.2021.12.
  • [2] Tatsuya Akutsu. Tree edit distance problems: Algorithms and applications to bioinformatics. IEICE Trans. Inf. Syst., 93-D(2):208–218, 2010. doi:10.1587/TRANSINF.E93.D.208.
  • [3] Tatsuya Akutsu, Daiji Fukagawa, and Atsuhiro Takasu. Approximating tree edit distance through string edit distance. Algorithmica, 57(2):325–348, 2010. doi:10.1007/s00453-008-9213-z.
  • [4] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput., 47(3):1087–1097, 2018. doi:10.1137/15M1053128.
  • [5] John Bellando and Ravi Kothari. Region-based modeling and tree edit distance as a basis for gesture recognition. In 1oth International Conference on Image Analysis and Processing (ICIAP 1999), 27-29 September 1999, Venice, Italy, pages 698–703. IEEE Computer Society, 1999. doi:10.1109/ICIAP.1999.797676.
  • [6] Philip Bille. A survey on tree edit distance and related problems. Theor. Comput. Sci., 337(1–3):217–239, June 2005. doi:10.1016/j.tcs.2004.12.030.
  • [7] Mahdi Boroujeni, Mohammad Ghodsi, MohammadTaghi Hajiaghayi, and Saeed Seddighin. 1+ϵ approximation of tree edit distance in quadratic time. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 709–720. ACM, 2019. doi:10.1145/3313276.3316388.
  • [8] Karl Bringmann, Paweł Gawrychowski, Shay Mozes, and Oren Weimann. Tree edit distance cannot be computed in strongly subcubic time (unless apsp can). ACM Trans. Algorithms, 16(4), July 2020. doi:10.1145/3381878.
  • [9] Peter Buneman, Martin Grohe, and Christoph Koch. Path queries on compressed xml. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 141–152. VLDB Endowment, 2003. doi:10.1016/b978-012722442-8/50021-5.
  • [10] Alejandro Cassis, Tomasz Kociumaka, and Philip Wellnitz. Optimal algorithms for bounded weighted edit distance. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2177–2187. IEEE, 2023. doi:10.1109/FOCS57990.2023.00135.
  • [11] Panagiotis Charalampopoulos, Pawel Gawrychowski, Shay Mozes, and Oren Weimann. On the hardness of computing the edit distance of shallow trees. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science, pages 290–302. Springer, 2022. doi:10.1007/978-3-031-20643-6_21.
  • [12] Sudarshan S. Chawathe. Comparing hierarchical data in external memory. In Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, editors, VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 90–101. Morgan Kaufmann, 1999. URL: http://www.vldb.org/conf/1999/P8.pdf.
  • [13] Gregory Cobena, Serge Abiteboul, and Amélie Marian. Detecting changes in XML documents. In Rakesh Agrawal and Klaus R. Dittrich, editors, Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 41–52. IEEE Computer Society, 2002. doi:10.1109/ICDE.2002.994696.
  • [14] Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, and Barna Saha. Weighted edit distance computation: Strings, trees, and dyck. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 377–390. ACM, 2023. doi:10.1145/3564246.3585178.
  • [15] Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha, and Hamed Saleh. O~(n+poly(k))-time algorithm for bounded tree edit distance. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022. IEEE, 2022. doi:10.1109/FOCS54457.2022.00071.
  • [16] Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares da Silva, and Alberto H. F. Laender. Automatic web news extraction using tree edit distance. In Stuart I. Feldman, Mike Uretsky, Marc Najork, and Craig E. Wills, editors, Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17-20, 2004, pages 502–511. ACM, 2004. doi:10.1145/988672.988740.
  • [17] Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An optimal decomposition algorithm for tree edit distance. ACM Trans. Algorithms, 6(1), December 2010. doi:10.1145/1644015.1644017.
  • [18] Anita Dürr. Improved bounds for rectangular monotone min-plus product and applications. Inf. Process. Lett., 181:106358, 2023. doi:10.1016/J.IPL.2023.106358.
  • [19] Dayi Fan, Rubao Lee, and Xiaodong Zhang. X-TED: massive parallelization of tree edit distance. Proc. VLDB Endow., 17(7):1683–1696, 2024. doi:10.14778/3654621.3654634.
  • [20] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), November 2009. doi:10.1145/1613676.1613680.
  • [21] Dvir Fried, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, and Tatiana Starikovskaya. An improved algorithm for the k-dyck edit distance problem. ACM Trans. Algorithms, 20(3):26, 2024. doi:10.1145/3627539.
  • [22] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, USA, 1997. doi:10.1017/cbo9780511574931.
  • [23] Matthias Höchsmann, Thomas Töller, Robert Giegerich, and Stefan Kurtz. Local similarity in RNA secondary structures. In 2nd IEEE Computer Society Bioinformatics Conference, CSB 2003, Stanford, CA, USA, August 11-14, 2003, pages 159–168. IEEE Computer Society, 2003. doi:10.1109/CSB.2003.1227315.
  • [24] Philip N. Klein. Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th Annual European Symposium on Algorithms, ESA ’98, pages 91–102, Berlin, Heidelberg, 1998. Springer-Verlag. doi:10.1007/3-540-68530-8_8.
  • [25] Philip N. Klein, Thomas B. Sebastian, and Benjamin B. Kimia. Shape matching using edit-distance: an implementation. In S. Rao Kosaraju, editor, Proceedings of the Twelfth Annual Symposium on Discrete Algorithms, January 7-9, 2001, Washington, DC, USA, pages 781–790. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365779.
  • [26] Philip N. Klein, Srikanta Tirthapura, Daniel Sharvit, and Benjamin B. Kimia. A tree-edit-distance algorithm for comparing simple, closed shapes. In David B. Shmoys, editor, Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, January 9-11, 2000, San Francisco, CA, USA, pages 696–704. ACM/SIAM, 2000. URL: http://dl.acm.org/citation.cfm?id=338219.338628.
  • [27] Tomasz Kociumaka and Ali Shahali. Faster algorithm for bounded tree edit distance in the low-distance regime, 2025. doi:10.48550/arXiv.2507.02701.
  • [28] Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. Journal of Computer and System Sciences, 37(1):63–78, 1988. doi:10.1016/0022-0000(88)90045-1.
  • [29] Xiao Mao. Breaking the cubic barrier for (unweighted) tree edit distance. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021, Denver, CO, USA, February 7-10, 2022, pages 792–803. IEEE, 2021. doi:10.1109/FOCS52979.2021.00082.
  • [30] J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3):762–776, 2001. doi:10.1137/S0097539799364092.
  • [31] Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251–266, 1986. doi:10.1007/BF01840446.
  • [32] Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Trans. Algorithms, 10(3):16:1–16:39, 2014. doi:10.1145/2601073.
  • [33] Jakob Nogler, Adam Polak, Barna Saha, Virginia Vassilevska Williams, Yinzhan Xu, and Christopher Ye. Faster weighted and unweighted tree edit distance and APSP equivalence. In 57th Annual ACM Symposium on Theory of Computing, STOC 2025, pages 2167–2178. ACM, 2025. doi:10.1145/3717823.3718116.
  • [34] Mateusz Pawlik and Nikolaus Augsten. Efficient computation of the tree edit distance. ACM Trans. Database Syst., 40(1):3:1–3:40, 2015. doi:10.1145/2699485.
  • [35] Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Inf. Syst., 56:157–173, 2016. doi:10.1016/J.IS.2015.08.004.
  • [36] Thomas B. Sebastian, Philip N. Klein, and Benjamin B. Kimia. Recognition of shapes by editing their shock graphs. IEEE Trans. Pattern Anal. Mach. Intell., 26(5):550–571, 2004. doi:10.1109/TPAMI.2004.1273924.
  • [37] Masoud Seddighin and Saeed Seddighin. 3+ϵ approximation of tree edit distance in truly subquadratic time. In 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, volume 215, pages 115:1–115:22, 2022. doi:10.4230/LIPIcs.ITCS.2022.115.
  • [38] Stanley M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6(6):184–186, 1977. doi:10.1016/0020-0190(77)90064-3.
  • [39] Bruce A. Shapiro and Kaizhong Zhang. Comparing multiple RNA secondary structures using tree comparisons. Comput. Appl. Biosci., 6(4):309–318, 1990. doi:10.1093/bioinformatics/6.4.309.
  • [40] Raghavendra Sridharamurthy, Talha Bin Masood, Adhitya Kamakshidasan, and Vijay Natarajan. Edit distance between merge trees. IEEE Trans. Vis. Comput. Graph., 26(3):1518–1531, 2020. doi:10.1109/TVCG.2018.2873612.
  • [41] Kuo-Chung Tai. The tree-to-tree correction problem. J. ACM, 26(3):422–433, July 1979. doi:10.1145/322139.322143.
  • [42] Hélène Touzet. A linear tree edit distance algorithm for similar ordered trees. In Proceedings of the 16th Annual Conference on Combinatorial Pattern Matching, CPM’05, pages 334–345, Berlin, Heidelberg, 2005. Springer-Verlag. doi:10.1007/11496656_29.
  • [43] Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1):100–118, 1985. International Conference on Foundations of Computation Theory. doi:10.1016/S0019-9958(85)80046-2.
  • [44] Yuan Wang, David J. DeWitt, and Jin-yi Cai. X-diff: An effective change detection algorithm for XML documents. In Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman, editors, Proceedings of the 19th International Conference on Data Engineering, March 5-8, 2003, Bangalore, India, pages 519–530. IEEE Computer Society, 2003. doi:10.1109/ICDE.2003.1260818.
  • [45] Michael S. Waterman. Introduction to computational biology - maps, sequences, and genomes: interdisciplinary statistics. CRC Press, 1995.
  • [46] R. Ryan Williams. Faster all-pairs shortest paths via circuit complexity. SIAM J. Comput., 47(5):1965–1985, 2018. doi:10.1137/15M1024524.
  • [47] Yoshiyuki Yamamoto, Kouichi Hirata, and Tetsuji Kuboyama. Tractable and intractable variations of unordered tree edit distance. Int. J. Found. Comput. Sci., 25(3):307–330, 2014. doi:10.1142/S0129054114500154.
  • [48] Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. Answer extraction as sequence tagging with tree edit distance. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 858–867. The Association for Computational Linguistics, 2013. URL: https://aclanthology.org/N13-1106/.
  • [49] Kaizhong Zhang and Dennis E. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245–1262, 1989. doi:10.1137/0218082.