Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime
Abstract
The tree edit distance is a natural dissimilarity measure between rooted ordered trees whose nodes are labeled over an alphabet . It is defined as the minimum number of node edits – insertions, deletions, and relabelings – required to transform one tree into the other. The weighted variant assigns costs to edits (based on node labels), minimizing total cost rather than edit count.
The unweighted tree edit distance between two trees of total size can be computed in time; in contrast, determining the weighted tree edit distance is fine-grained equivalent to the All-Pairs Shortest Paths (APSP) problem and requires time [Nogler, Polak, Saha, Vassilevska Williams, Xu, Ye; STOC’25]. These impractical super-quadratic times for large, similar trees motivate the bounded version, parameterizing runtime by the distance to enable faster algorithms for .
Prior algorithms for bounded unweighted edit distance achieve [Akmal & Jin; ICALP’21] and [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC’23]. For weighted, only is known [Das, Gilbert, Hajiaghayi, Kociumaka, Saha; STOC’23].
We present an -time algorithm for bounded tree edit distance in both weighted/unweighted settings. First, we devise a simpler weighted -time algorithm. Next, we exploit periodic structures in input trees via an optimized universal kernel: modifying prior -time -size kernels to generate such structured instances, enabling efficient analysis.
Keywords and phrases:
tree edit distance, edit distance, kernelization, dynamic programmingFunding:
Ali Shahali: The work of was carried out mostly during a summer internship at the Max Planck Institute for Informatics.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The edit distance between two strings – the minimum cost of insertions, deletions, and substitutions needed to transform one string into the other – is one of the most widely used string similarity measures with numerous algorithmic applications. It provides a robust model for comparing sequential data and underpins techniques in fields such as computational biology, natural language processing, and text correction. Nevertheless, many data types exhibit hierarchical rather than linear structure. For example, RNA secondary structures, syntactic parse trees of natural language sentences, and hierarchical representations of documents and code all store information in tree-like forms. One can linearize such data and still use string edit distance, but the resulting alignments disregard the original hierarchical structure, and thus more expressive similarity measures are needed in most scenarios.
First introduced by Selkow [38], tree edit distance generalizes the notion of string edit distance to rooted ordered trees and forests with nodes labeled over an alphabet . It quantifies the dissimilarity between two forests as the minimal cost of a sequence of node edits – insertions, deletions, and relabelings – required to transform one forest into the other. This distance naturally captures both structural and label-based differences and serves as a core primitive in various algorithmic and applied domains, including computational biology [22, 39, 23, 45], analysis of structured data (such as XML and JSON files) [9, 13, 12, 20, 44], image processing [5, 26, 25, 36], and information extraction [16, 48]; see [6, 2] for surveys.
Tai [41] proposed the first polynomial-time algorithm for computing the tree edit distance: a dynamic programming procedure running in time, where is the total number of nodes in the input forests. In the following decades, a sequence of works progressively improved the runtime. Zhang and Shasha [49] introduced an -time algorithm, and then Klein [24] brought the complexity down to . Subsequently, Demaine, Mozes, Rossman, and Weimann [17] presented an -time solution, whereas Bringmann, Gawrychowski, Mozes, and Weimann [8] proved that a hypothetical -time algorithm would violate the All-Pairs Shortest Paths (APSP) hypothesis in fine-grained complexity. Very recently, Nogler, Polak, Saha, Vassilevska Williams, Xu, and Ye [33] showed that computing the tree edit distance is, in fact, equivalent to the APSP problem, and the running time can be inherited from the state of the art for the latter task [46].
The aforementioned conditional lower bounds apply to the weighted tree edit distance only, where the cost of each edit depends on the involved labels. In the unweighted version with unit costs, a series of recent results achieved truly subcubic running time using fast matrix multiplication. Mao [29] presented an -time solution, which was subsequently improved by Dürr [18] to and by Nogler et al. [33] to , where is the fast matrix multiplication exponent and hides factors. The state-of-the-art conditional lower bound, inherited from string edit distance [4] and assuming the Orthogonal Vectors Hypothesis, prohibits -time algorithms already if .
Bounded Tree Edit Distance.
Even though the decades of research resulted in significantly faster tree edit distance algorithms, they also revealed fundamental challenges that explain the difficulty of this problem. A natural way of circumventing these barriers, paved by classical work for string edit distance [43, 31, 28], is to parameterize the running time not only in terms of the length of the input forests but also the value of the computed distance. This version of tree edit distance has originally been studied in the unweighted setting only. Touzet [42] adapted the ideas of Zhang and Shasha [49] to derive an -time algorithm, whereas Akmal and Jin [1] improved the running time to based on the approach of Klein [24]. The latter running time remains the state-of-the-art for medium distances. In the low-distance regime, Das, Gilbert, Hajiaghayi, Kociumaka, Saha, and Saleh [15] achieved an -time solution and subsequently improved the running time to [14]. They also studied the weighted version of the problem and presented an -time algorithm assuming that the weight function is normalized (each edit costs at least one unit; this prohibits arbitrary scaling) and satisfies the triangle inequality.
Still, the running times for tree edit distance are much larger than for string edit distance and the related Dyck edit distance problem, which asks for the minimum cost of edits needed to make a given string of parentheses balanced (so that it represents a node-labeled forest). In the unweighted setting, the state-of-the-art running times are for string edit distance [28] and for Dyck edit distance [21]. In the presence of weights, these complexities increase to [10] and [14], respectively. Thus, tree edit distance is a natural candidate for improvements.
Our Results.
As the main contribution, we bring the time complexity of tree edit distance to . Even for unit weights, this improves upon the state of the art if .
Theorem 1.1.
There exists a deterministic algorithm that, given two forests and with nodes in total, each with a label from an alphabet , and oracle access to a normalized weight function satisfying the triangle inequality, determines the tree edit distance in time.
We also prove that the running time by Akmal and Jin [1] remains valid in the presence of weights. This approach gives the best complexity when .
Theorem 1.2.
There exists a deterministic algorithm that, given two forests and , with nodes in total, each with a label from an alphabet , and oracle access to a normalized weight function , determines in time.
We believe that the original approach of Akmal and Jin [1] supports the weighted setting with minimal adjustments; nevertheless, we prove Theorem 1.2 using a slightly different algorithm that is better-suited for further optimizations. Both solutions are variants of Klein’s dynamic programming [24] with some states pruned. The main difference is that we view the input forests as balanced sequences of parentheses and cast the tree edit distance as the minimum cost of a string alignment satisfying a certain consistency property. In our interpretation (Section 3), it is almost trivial to see that the classic pruning rules, originally devised to compute string edit distance in time [43, 31], can be reused for tree edit distance in time. In contrast, Akmal and Jin [1, Lemma 9] spend over four pages (on top of the analysis of [24]) to bound the number of states surviving their pruning rules.
Theorem 1.2 combined with the results of [14] lets us derive an -time algorithm for weighted tree edit distance, matching the previously best running time for unit weights. This is due to the universal kernel that transforms the input forests to equivalent forests of size preserving the following capped tree edit distance value:
Theorem 1.3 ([14, Corollary 3.20]).
There exists a linear-time algorithm that, given forests and an integer , constructs forests of size such that holds for every normalized weight function satisfying the triangle inequality.
To bring the time complexity from to , we adapt both Theorems 1.2 and 1.3. The main novel insight is that the dynamic-programming procedure behind Theorem 1.2 behaves predictably while processing regions with certain repetitive (periodic) structure. Upon an appropriate relaxation of the invariant that the dynamic-programming values satisfy, processing each repetition of the period can be interpreted as a single min-plus matrix-vector multiplication. When the period repeats many times, the same matrix is used each time, and thus we can raise the matrix to an appropriate power (with fast exponentiation) and then compute a single matrix-vector product; see Section 4 for details. This can be seen as a natural counterpart of an optimization used in [10] for string edit distance.111The matrices arising in the string edit distance computation in [10] are Monge matrices, which allows for computing each matrix-vector product in time and each matrix-matrix product in time. In the context of tree edit distance, the Monge property is no longer satisfied, so the complexities increase to and respectively; in fact, computing the (weighted) tree edit distance is, in general, as hard as computing the min-plus product of two arbitrary matrices [33].
Intuitively, the worst-case instances for the kernelization algorithm of Theorem 1.3 are close to having the necessary structure for our optimization to improve the worst-case running time. Unfortunately, the implementation in [14] controls the forest sizes only, and it needs to be modified to keep track of the more subtle instance difficulty measure. The biggest challenge is that the original algorithm proceeds in logarithmically many steps, each shrinking the forests by a constant fraction. In every step, the forests are partitioned into pieces of size , and a constant fraction of pieces is shrunk to size each. A partition into pieces of equal “difficulty” would be much more challenging, so we instead take a different approach that lets us implement the reduction in a single step; see Section 5 for details. Notably, the kernelization algorithms for weighted string and Dyck edit distance already have single-step implementations in [14], but they crucially rely on the fact that the unit-weight variants of these problems can be solved faster, unlike for tree edit distance.
Related Work
The classical definition of tree edit distance [38] involves labeled rooted trees (or forests) with node labels. Many other variants have also been studied allowing, among others, unordered [47, 40], unrooted [40], and edge-labeled [3] trees; see also the surveys [2, 6].
The tree edit distance problem becomes easier not only when the computed distance is small but also when the input forests are of small depth [49, 11] or when the approximate distance suffices [3, 7, 37]. There is also a body of work focusing on practical performance and empirical evaluation, including sequential [34, 35] and parallel [19] implementations.
2 Preliminaries
Consistently with [14], we identify forests with the underlying Euler tours, interpreted as balanced strings of parentheses. This representation is at the heart of space-efficient data structures on trees [30, 32], and it allows for a simple definition of tree edits and a seamless transfer of many tools from strings to forests. Notably, Klein’s algorithm [24] already uses substrings of the Euler tours of the input forests to identify the dynamic-programming states.
For an alphabet , let denote the set parentheses with labels in . As formalized next, a forest with node labels over is a balanced string of parentheses in .
Definition 2.1.
The set of forests with labels over (balanced strings over ) is the smallest subset satisfying the following conditions:
-
,
-
for every ,
-
for every and .
For a forest , we define the set of nodes as the set of intervals such that is an opening parenthesis, is a closing parenthesis, and is balanced. A forest is a tree if . For a node , we denote the positions of the opening and the closing parenthesis by and . The label of , that is, the character such that and , is denoted by . A simple inductive argument shows that forms a laminar family and, for every , there is a unique node such that ; we denote this node by . We also write for the paired parenthesis, that is, if .
We say that a node is contained in a fragment of a forest if ; we denote by the set of nodes contained in . Moreover, enters if and exits if . In either of these two cases, we also say that straddles .
We denote by the subforest of induced by , which we obtain from by deleting (the parentheses corresponding to) all nodes except for those contained in . Alternatively, one can obtain from by deleting the opening parenthesis of every node that exits and the closing parenthesis of every node that enters .
2.1 Tree Edits, Forest Alignments, and Tree Edit Distance
For an alphabet , we define , where is the empty string. We say that a function is normalized if and hold for distinct .
Tree edit distance is classically defined using elementary edits transforming :
- Node insertion
-
produces for a balanced fragment and a label , at cost .
- Node relabeling
-
produces for a node and a label , at cost .
- Node deletion
-
produces for a node , at cost .
The tree edit distance of two forests is then defined as the minimum cost of a sequence of edits transforming to . In this context, without loss of generality, we can replace by its transitive closure. If, say , instead of directly deleting a node with label , it is more beneficial to first change its label to and only then perform the deletion. When satisfies the triangle inequality, we are guaranteed that an inserted or relabeled node is never modified (deleted or relabeled) again. Consistently with modern literature [8, 14, 33], we use a more general alignment-based definition of that enforces the latter condition even if does not necessarily satisfy the triangle inequality.
Definition 2.2 (Alignment Graph [10]).
For strings and a weight function , we define the alignment graph as a grid graph with vertices and the following directed edges:
-
horizontal edges of cost for ,
-
vertical edges of cost for , and
-
diagonal edges of cost for .
The alignment graph allows for a concise definition of a string alignment.
Definition 2.3 (Alignment).
For strings and a weight function , an alignment of onto , denoted by , is a path from to in , interpreted as a sequence of vertices. The cost of the alignment is the total costs of the edges that belong to .
We write for the set of all alignments of onto .
The edges of an alignment can be interpreted as follows:
-
If includes an edge for some and , then deletes , denoted by .
-
If includes an edge for some and , then inserts , denoted by .
-
If includes an edge for some and , then aligns to , denoted by . If , then substitutes for . If , then matches with .
Insertions, deletions, and substitutions are jointly called character edits.
The weighted edit distance of fragments and with respect to a weight function is defined as the minimum cost of an alignment:
We often consider alignments of the entire string onto the entire string ; we then used simplified notation including , , and .
Definition 2.4 (Forest alignment).
Consider forests . An alignment is a forest alignment if it satisfies the following consistency condition:
For every two aligned characters , also .
We write for the set of forest alignments.
Remark 2.5.
Consider a forest alignment . Then,
-
deletes every character with and ,
-
inserts every character with and .
Define and a mapping such that for each , and . For a weight function , we define a corresponding weight function so that for all . The cost of a forest alignment with respect to a weight function is defined as . Moreover, we define
For a threshold , we set
By Remark 2.5, the following value is non-negative for every :
We naturally generalize this value to .
Observation 2.6.
For all forests , fragments and , and weight functions , we have .
3 -Time Algorithm
In this section, we reinterpret Klein’s algorithm [24] and develop its -time variant.
3.1 Klein’s Algorithm
Klein’s algorithm [24] uses dynamic programming to compute for selected fragments of and all fragments of . We modify the algorithm slightly so that the computed value includes the costs of deleting the single parentheses of nodes straddling and inserting the single parentheses of nodes straddling .
The algorithm’s implementation is provided in Algorithm 1. We use an operator that assigns if . We also remember which of these assignments were applied so that we can later trace back the optimal alignment based on this extra information stored.
In the corner case when or is empty, then the unique (and thus optimal) forest alignment pays for inserting or deleting all characters in the other fragment. If both and are non-empty, the algorithm considers nodes , , , and .
Let us first suppose that and are contained in . If the subtree of is at least as large as the subtree of , that is, , where , then algorithm considers three possibilities: is deleted, is inserted, or is aligned with . In the first two possibilities, we can pay for the deleted or inserted opening parenthesis and align the remaining fragments; see Lines 6–7. In the third possibility, the consistency condition in the definition of the forest alignments requires that is also aligned with . In particular, the node needs to be contained in so that and . Thus, we align with , align with , and pay for aligning with , that is with and with ; see Line 9. If , the algorithm handles and in a symmetric way; see Lines 10–14.
If , we follow Lines 5–9 and effectively consider deleting and inserting . Else, if , we follow the symmetric Lines 10–14.
Zhang and Shasha [49] use a very similar dynamic programming; in a baseline version, their algorithm always constructs the alignment from left to right instead of picking the side depending on and . Klein’s optimization allows for an improved running time of instead of . In the full version [27], we recall the proof of the following lemma.
Lemma 3.1.
The recursive implementation of Algorithm 1 visits fragments of .
Akmal and Jin’s Algorithm.
Akmal and Jin [1] reduce the time complexity of Klein’s algorithm to when computing . Their solutions prunes some dynamic programming states so that, in the surviving states, the differences between the sizes of the subforests and is at most . This condition also holds for the difference between the sizes of and , as well as between and ; see [1, Lemma 12]. As shown in [1, Lemma 13], for each subforest , there are subforests satisfying these three conditions. With a careful implementation, the total number of states becomes .
3.2 Our Algorithm
Our variant of Klein’s algorithm simply prunes all states with or . In other words, we modify Algorithm 1 so that is set to if either condition holds, and the existing instructions are executed otherwise. This does not increase the number of visited fragments ; see Lemma 3.1. For each of these fragments, trivially, at most fragments of survive pruning, so the total number of states is . The correctness of the pruning rules follows from the fact that, for all forests , the width of any alignment in does not exceed twice its cost, where the width of an alignment is defined as .
In the full version [27], we formalize this intuition using the notion of bounded forest alignments.
Definition 3.2.
Let us fix a threshold and consider fragments and of forests . We call a forest alignment bounded if . The family of bounded forest alignments is . For a weight function , we denote
Lemma 3.3.
The values computed using the pruned version of Algorithm 1 satisfy
In particular, the entry stores a value between and . The following observation implies that this is enough to retrieve .
Observation 3.4.
If , then .
Proof.
Consider the optimal forest alignment with . For each edge , either (if the edge is diagonal) or and the edge cost is at least (if the edge is vertical or horizontal). Since and the total cost of edges in is at most , every satisfies , i.e., and . Hence, . The converse inequality holds trivially due to .
With minor implementation details needed to avoid logarithmic overheads for memoization using a sparse table (Akmal and Jin [1] ignore this issue), we achieve the following result.
Theorem 1.2. [Restated, see original statement.]
There exists a deterministic algorithm that, given two forests and , with nodes in total, each with a label from an alphabet , and oracle access to a normalized weight function , determines in time.
4 Faster Algorithm for Repetitive Inputs
In this section, we present an optimized version of our -time algorithm capable of exploiting certain repetitive structures within the input forests and . The following notion of free pairs captures the structure that our algorithm is able to utilize.
Definition 4.1 (Free pair, free block).
Consider forests and a fixed threshold . We call a pair of fragments and a free pair if
-
, and
-
there exists a balanced string with such that holds for some integer exponent .
For that free pair, we call the fragment a free block with period and exponent .
Our improved algorithm assumes that the input forests and are augmented with a collection of disjoint free blocks , each associated with the underlying period , exponent , and the corresponding fragment . The speed-up compared to the algorithm of Section 3 is noticeable if the free blocks in jointly cover most of the characters of , that is, the number of remaining non-free characters is asymptotically smaller than .
Theorem 4.2.
There exists a deterministic algorithm that, given forests of total length , oracle access to a normalized weight function , a threshold , and a collection of disjoint free blocks in such that characters of are not contained in any free block, computes in time.
Intuitively, the algorithm skips all the free blocks, which allows reducing the running time of Theorem 1.2 to , i.e., the algorithm needs to pay for non-free characters only. Nevertheless, processing each free block takes extra time, and -time preprocessing time is still needed to avoid overheads for memoization.
Processing Free Blocks.
To understand our speed-up, let us consider a free pair with period and exponent , that is, . We picked the parameters so that , and thus every alignment aligns with a fragment contained in . By the same argument, the image of every copy of within is contained within the corresponding copy of within . Moreover, when we align to , it suffices to partition into fragments and independently optimally align each copy of in with the corresponding fragment of . This is because is balanced, so every node of is contained within a single copy of , and thus the consistency condition in Definition 2.4 does not impose any constraints affecting multiple copies of .
The optimal costs of aligning with relevant fragments of can be encoded in the following matrix, constructible in time using Algorithm 1 (Klein’s algorithm).
Definition 4.3.
For a balanced string , we define a matrix of size with indices in the range as follows:
In order to derive the optimal costs of aligning with the relevant fragments of , we simply compute the -th power of with respect to the min-plus product. Formally, the min-plus product of matrices and is a matrix such that for .
For each free block , we construct the matrix in time using Algorithm 1 followed by fast exponentiation, which reduces to computing min-plus products. We apply the matrix whenever we are tasked with filling a entry such that is a prefix or a suffix of . Formally, if , then we use the following formula instead of following Algorithm 1:
| (1) |
In the symmetric case of , we apply
| (2) |
In the full version [27], we formalize the intuition above to prove that Equations 1 and 2 preserve the invariant of Lemma 3.3, that is,
For (1), this boils down to the following lemma; the case of (2) is symmetric.
Lemma 4.4.
Consider fragments , of forests , a normalized weight function , and a threshold such that . If there is a free pair such that is a suffix of , then
We further argue in the full version [27] that the recursive implementation of Algorithm 1 augmented with the optimizations of (1) and (2) visits fragments , where is the number of non-free characters, including fragments for which (1) or (2) apply. Specifically, our optimized algorithm visits fragments visited by the original Algorithm 1 and satisfying the following additional property: every free block is either disjoint with or contained in . Our implementation takes extra preprocessing time to list fragments visited by Algorithm 1 and filter those satisfying the aforementioned property.
5 Universal Kernel with Improved Repetitiveness Guarantees
In this section, we outline our approach to strengthen Theorem 1.3 into the following result:
Theorem 5.1.
There exists a linear-time algorithm that, given forests , and an integer , constructs forests , of size such that holds for every normalized quasimetric (weight function satisfying the triangle inequality), and a collection of disjoint free blocks in with non-free characters.
Compared to Theorem 1.3, we require the presence of free blocks that jointly capture all but characters of the output forest ; consult Definition 4.1. At the very high level, the proofs of both Theorems 1.3 and 5.1 consist in three steps: decomposing the input forests and into several pieces, identifying pieces that can be matched exactly, and replacing (pairs of) identical pieces with smaller equivalent counterparts.
Forest Decompositions and Piece Matchings.
Following [14], we say that a piece of a forest is a subforest – a balanced fragment – or a context – a pair of fragments such that is a tree and is balanced. We denote the set of pieces contained in a fragment of by ; we set .
In isolation from , a context can be interpreted as a pair of non-empty strings such that is a tree. The composition of contexts results in a context . Moreover, the composition of a context and a forest results in a tree . For any decomposition of a forest into disjoint pieces, one can recover using the concatenation and composition operations.
We define the depth of a context to be the number nodes of with the opening parenthesis in and the closing parenthsis in . Note that the depth of the context is equal to the sum of the depths of and .
The following notion formalizes the concept of a matching between pieces of and .
Definition 5.2.
For two forests and and a fixed threshold , a piece matching between and is a set of pairs such that:
-
across all pairs , the pieces are pairwise disjoint, and
-
there exists a forest alignment of width at most (i.e., ) that matches to perfectly for every .
The kernelization algorithm behind Theorem 1.3 repeatedly identifies a piece matching of size covering vertices of and replaces each pair of matching pieces with a pair of “equivalent” pieces of size . After steps, this yields forests of size . Our strategy relies on the following new result:
Theorem 5.3.
There exists a linear-time algorithm that, given forests and a threshold , either certifies that or constructs a size- piece matching between and that leaves unmatched characters.
The proof of Theorem 5.3, presented in the full version [27], reuses a subroutine of [14] to construct a decomposition of into pieces of size each. The next step is to build a piece matching that leaves at most pieces of unmatched. We modify a dynamic-programming procedure from [14] so that, additionally, the unmatched characters of form fragments. As a result, even though the obtained matching is of size , it is possible to reduce its size to . For this, it suffices to repeatedly identify pairs of adjacent pieces matched to adjacent pieces , and then replace these pieces with their unions and (as formalized in the full version [27], two disjoint pieces are adjacent if their union, consisting of the characters contained in at least one of these pieces, can be interpreted as a piece).
Periodic Blocks and Red Characters.
The main ingredient of our kernelization algorithm is a procedure that replaces a pair of matching pieces with a pair of “equivalent” smaller pieces . Unlike [14], where the goal was to reduce the piece size to , we aim to identify disjoint free blocks with non-free nodes within the replacement piece . Unfortunately, free blocks lack a canonical construction, and it would be tedious to maintain a specific selection while the forests change. Instead, we use the following notions:
Definition 5.4.
For a fixed threshold , we say that a string forms a periodic block if it satisfies the following properties:
-
, that is, the fragment is of length at least , and
-
has a string period of length at most with equally many opening and closing parentheses.
For a string , we denote by the set of fragments of that are periodic blocks. For a context , we denote by the disjoint union of and .
We partition the characters of into black and red based on the family .
Definition 5.5.
A character of a string is black if there exists a periodic block such that . The remaining characters are red. We denote by and the sets of black and red characters of .
For a context , the sets and are defined as disjoint unions.
Piece Reduction.
The following definitions in [14] formalize the concept of equivalent pieces.
Definition 5.6 ([14, Definition 3.4]).
For a threshold and a weight function , forests are called -equivalent if
holds for all forests and with matching pieces satisfying .
Definition 5.7 ([14, Definition 3.9]).
For a threshold and a weight function , contexts and are called -equivalent if
holds for all forests and with matching pieces satisfying and .
Lemma 5.8 (see [14, Lemma 3.17]).
There is a linear-time algorithm that, given a forest and a threshold , computes a forest with and such that and are -equivalent for every normalized quasimetric weight function .
Lemma 5.9 (see [14, Lemma 3.18]).
There is a linear-time algorithm that, given a context and a threshold , computes a context with and such that and are -equivalent for every normalized quasimetric weight function .
Complete Kernelization Algorithm.
In the full version [27], we combine the above ingredients to formally prove Theorem 5.1. Our procedure first applies Theorem 5.3. Then, for every pair of matched pieces , we use Lemma 5.8 or 5.9 (depending on piece type) to obtain an equivalent piece with red characters. The remaining (black) characters in can be traced back to periodic blocks and each periodic block within yields a free block in the output forest that covers the underlying black characters.
6 Summary
Theorem 1.1. [Restated, see original statement.]
There exists a deterministic algorithm that, given two forests and with nodes in total, each with a label from an alphabet , and oracle access to a normalized weight function satisfying the triangle inequality, determines the tree edit distance in time.
Proof.
First, suppose that the task is to compute for a given threshold . In this case, we use the algorithm of Theorem 5.1, resulting in a pair of forests of size such that , as well as a collection of free blocks in with non-free characters. Based on this, the algorithm of Theorem 4.2 computes in time. Including the running time of Theorem 5.1, we get total time.
In the absence of a given threshold, we consider a geometric sequence of thresholds , with , and we compute for subsequent until holds for some , which indicates .
Since , the initial iteration costs time. Consequently, if , then the whole algorithm runs in time.
Due to , the running time of the th iteration iteration is . This sequence grows geometrically, so the total running time of the algorithm is dominated by the running time of the last iteration, which is . If , then and, since the algorithm has not terminated one iteration earlier, . Consequently, the overall running time is when .
References
- [1] Shyan Akmal and Ce Jin. Faster algorithms for bounded tree edit distance. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 12:1–12:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICALP.2021.12.
- [2] Tatsuya Akutsu. Tree edit distance problems: Algorithms and applications to bioinformatics. IEICE Trans. Inf. Syst., 93-D(2):208–218, 2010. doi:10.1587/TRANSINF.E93.D.208.
- [3] Tatsuya Akutsu, Daiji Fukagawa, and Atsuhiro Takasu. Approximating tree edit distance through string edit distance. Algorithmica, 57(2):325–348, 2010. doi:10.1007/s00453-008-9213-z.
- [4] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput., 47(3):1087–1097, 2018. doi:10.1137/15M1053128.
- [5] John Bellando and Ravi Kothari. Region-based modeling and tree edit distance as a basis for gesture recognition. In 1oth International Conference on Image Analysis and Processing (ICIAP 1999), 27-29 September 1999, Venice, Italy, pages 698–703. IEEE Computer Society, 1999. doi:10.1109/ICIAP.1999.797676.
- [6] Philip Bille. A survey on tree edit distance and related problems. Theor. Comput. Sci., 337(1–3):217–239, June 2005. doi:10.1016/j.tcs.2004.12.030.
- [7] Mahdi Boroujeni, Mohammad Ghodsi, MohammadTaghi Hajiaghayi, and Saeed Seddighin. 1+ approximation of tree edit distance in quadratic time. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 709–720. ACM, 2019. doi:10.1145/3313276.3316388.
- [8] Karl Bringmann, Paweł Gawrychowski, Shay Mozes, and Oren Weimann. Tree edit distance cannot be computed in strongly subcubic time (unless apsp can). ACM Trans. Algorithms, 16(4), July 2020. doi:10.1145/3381878.
- [9] Peter Buneman, Martin Grohe, and Christoph Koch. Path queries on compressed xml. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 141–152. VLDB Endowment, 2003. doi:10.1016/b978-012722442-8/50021-5.
- [10] Alejandro Cassis, Tomasz Kociumaka, and Philip Wellnitz. Optimal algorithms for bounded weighted edit distance. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 2177–2187. IEEE, 2023. doi:10.1109/FOCS57990.2023.00135.
- [11] Panagiotis Charalampopoulos, Pawel Gawrychowski, Shay Mozes, and Oren Weimann. On the hardness of computing the edit distance of shallow trees. In Diego Arroyuelo and Barbara Poblete, editors, String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8-10, 2022, Proceedings, volume 13617 of Lecture Notes in Computer Science, pages 290–302. Springer, 2022. doi:10.1007/978-3-031-20643-6_21.
- [12] Sudarshan S. Chawathe. Comparing hierarchical data in external memory. In Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie, editors, VLDB’99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 90–101. Morgan Kaufmann, 1999. URL: http://www.vldb.org/conf/1999/P8.pdf.
- [13] Gregory Cobena, Serge Abiteboul, and Amélie Marian. Detecting changes in XML documents. In Rakesh Agrawal and Klaus R. Dittrich, editors, Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 41–52. IEEE Computer Society, 2002. doi:10.1109/ICDE.2002.994696.
- [14] Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, and Barna Saha. Weighted edit distance computation: Strings, trees, and dyck. In Barna Saha and Rocco A. Servedio, editors, Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 377–390. ACM, 2023. doi:10.1145/3564246.3585178.
- [15] Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, Barna Saha, and Hamed Saleh. -time algorithm for bounded tree edit distance. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022. IEEE, 2022. doi:10.1109/FOCS54457.2022.00071.
- [16] Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares da Silva, and Alberto H. F. Laender. Automatic web news extraction using tree edit distance. In Stuart I. Feldman, Mike Uretsky, Marc Najork, and Craig E. Wills, editors, Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17-20, 2004, pages 502–511. ACM, 2004. doi:10.1145/988672.988740.
- [17] Erik D. Demaine, Shay Mozes, Benjamin Rossman, and Oren Weimann. An optimal decomposition algorithm for tree edit distance. ACM Trans. Algorithms, 6(1), December 2010. doi:10.1145/1644015.1644017.
- [18] Anita Dürr. Improved bounds for rectangular monotone min-plus product and applications. Inf. Process. Lett., 181:106358, 2023. doi:10.1016/J.IPL.2023.106358.
- [19] Dayi Fan, Rubao Lee, and Xiaodong Zhang. X-TED: massive parallelization of tree edit distance. Proc. VLDB Endow., 17(7):1683–1696, 2024. doi:10.14778/3654621.3654634.
- [20] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), November 2009. doi:10.1145/1613676.1613680.
- [21] Dvir Fried, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, and Tatiana Starikovskaya. An improved algorithm for the k-dyck edit distance problem. ACM Trans. Algorithms, 20(3):26, 2024. doi:10.1145/3627539.
- [22] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, USA, 1997. doi:10.1017/cbo9780511574931.
- [23] Matthias Höchsmann, Thomas Töller, Robert Giegerich, and Stefan Kurtz. Local similarity in RNA secondary structures. In 2nd IEEE Computer Society Bioinformatics Conference, CSB 2003, Stanford, CA, USA, August 11-14, 2003, pages 159–168. IEEE Computer Society, 2003. doi:10.1109/CSB.2003.1227315.
- [24] Philip N. Klein. Computing the edit-distance between unrooted ordered trees. In Proceedings of the 6th Annual European Symposium on Algorithms, ESA ’98, pages 91–102, Berlin, Heidelberg, 1998. Springer-Verlag. doi:10.1007/3-540-68530-8_8.
- [25] Philip N. Klein, Thomas B. Sebastian, and Benjamin B. Kimia. Shape matching using edit-distance: an implementation. In S. Rao Kosaraju, editor, Proceedings of the Twelfth Annual Symposium on Discrete Algorithms, January 7-9, 2001, Washington, DC, USA, pages 781–790. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365779.
- [26] Philip N. Klein, Srikanta Tirthapura, Daniel Sharvit, and Benjamin B. Kimia. A tree-edit-distance algorithm for comparing simple, closed shapes. In David B. Shmoys, editor, Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, January 9-11, 2000, San Francisco, CA, USA, pages 696–704. ACM/SIAM, 2000. URL: http://dl.acm.org/citation.cfm?id=338219.338628.
- [27] Tomasz Kociumaka and Ali Shahali. Faster algorithm for bounded tree edit distance in the low-distance regime, 2025. doi:10.48550/arXiv.2507.02701.
- [28] Gad M. Landau and Uzi Vishkin. Fast string matching with differences. Journal of Computer and System Sciences, 37(1):63–78, 1988. doi:10.1016/0022-0000(88)90045-1.
- [29] Xiao Mao. Breaking the cubic barrier for (unweighted) tree edit distance. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021, Denver, CO, USA, February 7-10, 2022, pages 792–803. IEEE, 2021. doi:10.1109/FOCS52979.2021.00082.
- [30] J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3):762–776, 2001. doi:10.1137/S0097539799364092.
- [31] Eugene W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251–266, 1986. doi:10.1007/BF01840446.
- [32] Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Trans. Algorithms, 10(3):16:1–16:39, 2014. doi:10.1145/2601073.
- [33] Jakob Nogler, Adam Polak, Barna Saha, Virginia Vassilevska Williams, Yinzhan Xu, and Christopher Ye. Faster weighted and unweighted tree edit distance and APSP equivalence. In 57th Annual ACM Symposium on Theory of Computing, STOC 2025, pages 2167–2178. ACM, 2025. doi:10.1145/3717823.3718116.
- [34] Mateusz Pawlik and Nikolaus Augsten. Efficient computation of the tree edit distance. ACM Trans. Database Syst., 40(1):3:1–3:40, 2015. doi:10.1145/2699485.
- [35] Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Inf. Syst., 56:157–173, 2016. doi:10.1016/J.IS.2015.08.004.
- [36] Thomas B. Sebastian, Philip N. Klein, and Benjamin B. Kimia. Recognition of shapes by editing their shock graphs. IEEE Trans. Pattern Anal. Mach. Intell., 26(5):550–571, 2004. doi:10.1109/TPAMI.2004.1273924.
- [37] Masoud Seddighin and Saeed Seddighin. approximation of tree edit distance in truly subquadratic time. In 13th Innovations in Theoretical Computer Science Conference, ITCS 2022, volume 215, pages 115:1–115:22, 2022. doi:10.4230/LIPIcs.ITCS.2022.115.
- [38] Stanley M. Selkow. The tree-to-tree editing problem. Information Processing Letters, 6(6):184–186, 1977. doi:10.1016/0020-0190(77)90064-3.
- [39] Bruce A. Shapiro and Kaizhong Zhang. Comparing multiple RNA secondary structures using tree comparisons. Comput. Appl. Biosci., 6(4):309–318, 1990. doi:10.1093/bioinformatics/6.4.309.
- [40] Raghavendra Sridharamurthy, Talha Bin Masood, Adhitya Kamakshidasan, and Vijay Natarajan. Edit distance between merge trees. IEEE Trans. Vis. Comput. Graph., 26(3):1518–1531, 2020. doi:10.1109/TVCG.2018.2873612.
- [41] Kuo-Chung Tai. The tree-to-tree correction problem. J. ACM, 26(3):422–433, July 1979. doi:10.1145/322139.322143.
- [42] Hélène Touzet. A linear tree edit distance algorithm for similar ordered trees. In Proceedings of the 16th Annual Conference on Combinatorial Pattern Matching, CPM’05, pages 334–345, Berlin, Heidelberg, 2005. Springer-Verlag. doi:10.1007/11496656_29.
- [43] Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1):100–118, 1985. International Conference on Foundations of Computation Theory. doi:10.1016/S0019-9958(85)80046-2.
- [44] Yuan Wang, David J. DeWitt, and Jin-yi Cai. X-diff: An effective change detection algorithm for XML documents. In Umeshwar Dayal, Krithi Ramamritham, and T. M. Vijayaraman, editors, Proceedings of the 19th International Conference on Data Engineering, March 5-8, 2003, Bangalore, India, pages 519–530. IEEE Computer Society, 2003. doi:10.1109/ICDE.2003.1260818.
- [45] Michael S. Waterman. Introduction to computational biology - maps, sequences, and genomes: interdisciplinary statistics. CRC Press, 1995.
- [46] R. Ryan Williams. Faster all-pairs shortest paths via circuit complexity. SIAM J. Comput., 47(5):1965–1985, 2018. doi:10.1137/15M1024524.
- [47] Yoshiyuki Yamamoto, Kouichi Hirata, and Tetsuji Kuboyama. Tractable and intractable variations of unordered tree edit distance. Int. J. Found. Comput. Sci., 25(3):307–330, 2014. doi:10.1142/S0129054114500154.
- [48] Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. Answer extraction as sequence tagging with tree edit distance. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 858–867. The Association for Computational Linguistics, 2013. URL: https://aclanthology.org/N13-1106/.
- [49] Kaizhong Zhang and Dennis E. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245–1262, 1989. doi:10.1137/0218082.
