Bounded Weighted Edit Distance
Abstract
The edit distance of two strings is the minimum number of character edits (insertions, deletions, and substitutions) needed to transform into . Its weighted counterpart minimizes the total cost of edits, where the costs of individual edits, depending on the edit type and the characters involved, are specified using a function , normalized so that each edit costs at least one. The textbook dynamic-programming procedure, given strings and oracle access to , computes in time. Nevertheless, one can achieve better running times if the computed distance, denoted , is small: for unit weights [Landau and Vishkin; JCSS’88] and 111Henceforth, the notation hides factors poly-logarithmic in . for arbitrary weights [Cassis, Kociumaka, Wellnitz; FOCS’23].
In this paper, we study the dynamic version of the weighted edit distance problem, where the goal is to maintain for strings that change over time, with each update specified as an edit in or . Very recently, Gorbachev and Kociumaka [STOC’25] showed that the unweighted distance can be maintained in time per update after -time preprocessing; here, denotes the current value of . Their algorithm generalizes to small integer weights, but the underlying approach is incompatible with large weights.
Our main result is a dynamic algorithm that maintains in time per update after -time preprocessing. Here, is a real trade-off parameter and is an integer threshold fixed at preprocessing time, with returned whenever . We complement our algorithm with conditional lower bounds showing fine-grained optimality of our trade-off for and justifying our choice to fix .
We also generalize our solution to a much more robust setting while preserving the fine-grained optimal trade-off. Our full algorithm maintains subject not only to character edits but also substring deletions and copy-pastes, each supported in time. Instead of dynamically maintaining , it answers queries that, given any string specified through a sequence of arbitrary edits transforming into , in time compute or report that .
Keywords and phrases:
Edit distance, dynamic algorithms, conditional lower boundsFunding:
Itai Boneh: supported by Israel Science Foundation grant 810/21.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Among the most fundamental string processing problems is the task of deciding whether two strings are similar to each other. A classic measure of string (dis)similarity is the edit distance, also known as the Levenshtein distance [22]. The (unit-cost) edit distance of two strings and is defined as the minimum number of character edits (insertions, deletions, and substitutions) needed to transform into . In typical use cases, edits model naturally occurring modifications, such as typographical errors in natural-language texts or mutations in biological sequences. Some of these changes occur more frequently than others, which motivated introducing weighted edit distance already in the 1970s [33]. In this setting, each edit has a cost that may depend on its type and the characters involved (but nothing else), and the goal is to minimize the total cost of edits rather than their quantity. The costs of individual edits can be specified through a weight function , where denotes the alphabet extended with a symbol representing the lack of a character. For , the cost of inserting is , the cost of deleting is , and the cost of substituting for is . (Note that the cost of an edit is independent of the position of the edited character in the string.) Consistently with previous works, we assume that is normalized: and hold for every distinct . The unweighted edit distance constitutes the special case when holds for .
The textbook dynamic-programming algorithm [32, 25, 26, 33] takes time to compute the edit distance of two strings of length at most , and some of its original formulations incorporate weights [26, 33, 27]. Unfortunately, there is little hope for much faster solutions: for any constant , an -time algorithm would violate the Orthogonal Vectors Hypothesis [1, 7, 2, 5], and hence the Strong Exponential Time Hypothesis [16, 17]. The lower bound holds already for unit weights and many other fixed weight functions [7].
Bounded Edit Distance.
On the positive side, the quadratic running time can be improved when the computed distance is small. For unweighted edit distance, the algorithm by Landau and Vishkin [21], building upon the ideas of [30, 24], computes in time for any two strings . This running time is fine-grained optimal: for any , a hypothetical -time algorithm, even restricted to instances satisfying for some constant , would violate the Orthogonal Vectors Hypothesis.
As far as the bounded weighted edit distance problem is concerned, a simple optimization by Ukkonen [30] improves the quadratic time complexity of the classic dynamic-programming algorithm to , where . Recently, Das, Gilbert, Hajiaghayi, Kociumaka, and Saha [12] developed an -time solution and, soon afterwards, Cassis, Kociumaka, and Wellnitz [8] presented an -time algorithm. Due to the necessity to read the entire input, the latter running time is optimal for . More surprisingly, there is a tight conditional lower bound for [8], valid already for edit costs in the real interval . For any constants and , an -time algorithm restricted to instances satisfying would violate the All-Pairs Shortest Paths Hypothesis [31]. The optimal complexity remains unknown for , where the upper bound is , yet the lower-bound construction allows for an -time algorithm.
Dynamic Edit Distance.
Over the last decades, the edit distance problem has been studied in numerous variants, including the dynamic ones, where the input strings change over time. The most popular dynamic (weighed) edit distance formulation is the following one [9, 15].
Problem 1.1 (Dynamic Weighted Edit Distance).
Given an integer and oracle access to a normalized weight function , maintain strings subject to updates (character edits in and ) and report after each update.222For clarity, the introduction focuses on the version of the problem where the value is reported after every update. In Theorem 1.7, we address a more general setting where queries may occur less frequently, allowing us to distinguish between update time and query time.
The time complexity of ˜1.1 is already well understood if it is measured solely in terms of the string length . In case of unit weights, the algorithm of Charalampopoulos, Kociumaka, and Mozes [9] (building upon the techniques by Tiskin [28]) is characterized by update time and preprocessing time; any polynomial-factor improvement would violate the lower bound for the static edit distance problem. As far as arbitrary weights are concerned, the approach presented in [9, Section 4] achieves update time after -time preprocessing, and there is a matching fine-grained lower bound [8, Section 6] conditioned on the All-Pairs Shortest Paths Hypothesis.
In this work, we aim to understand ˜1.1 for small distances:
How fast can one dynamically maintain when this value is small?
Very recently, Gorbachev and Kociumaka [15] settled the parameterized complexity of ˜1.1 in the unweighted case. Their solution achieves preprocessing time and update time; improving either complexity by a polynomial factor, even for instances satisfying for some , would violate the Orthogonal Vectors Hypothesis.
Interestingly, the approach of [15] builds upon the static procedure for bounded weighted edit distance [8] and, as a result, the dynamic algorithm seamlessly supports small integer weights, achieving update time for edit costs in . Unfortunately, [15] does not provide any improvements for arbitrary weights; in that case, the results of [8] yield a solution to ˜1.1 with preprocessing time and update time. This is far from satisfactory: for , the algorithm simply recomputes from scratch after every update, and for , it does not store (and reuse) anything beyond a dynamic data structure for checking the equality between fragments of and .
Our Results: Lower Bounds
The reason why the approach of [15] is incompatible with large weights, e.g., , is that their presence allows a single update to drastically change . Our first result strengthens the lower bound of [8] and shows that the -time static algorithm is fine-grained optimal already for instances that can be transformed from a pair of equal strings using four edits. The ideas behind the following theorem are presented in Section 5.
Theorem 1.2.
Let and be real parameters. Assuming the APSP Hypothesis, there is no algorithm that, given two strings satisfying , a threshold , and oracle access to a normalized weight function , in time decides whether .
As a simple corollary, we conclude that significantly improving upon the naïve update time of for requires large preprocessing time.
Corollary 1.3.
Suppose that ˜1.1 admits a solution with preprocessing time and update time for some non-decreasing functions and . If and hold for some real parameters and , then the APSP Hypothesis fails.
Proof idea..
For an instance of the static problem of Theorem 1.2, we initialize the dynamic algorithm with and transform one copy of into using four updates.
As discussed in the full version of the paper [6], instead of initializing the dynamic algorithm with , we can initialize it with a pair of empty strings and gradually transform this instance to using updates while keeping the weighted edit distance between and at all times. Hence, as long as the preprocessing of takes time and is not allowed to access the entire weight function (e.g., weights are revealed online when an update introduces a fresh character for the first time) , we can replace with in the statement of Corollary 1.3.
Overall, we conclude that, in order to support updates with relatively large distances faster than naively (i.e., by computing from scratch using a static algorithm), one needs to pay a lot for both preprocessing and updates while is still very small. In particular, we should decide in advance how large distances need to be supported efficiently. This justifies a simplified variant of ˜1.1, where the parameter is fixed at preprocessing time. Here, instead of , the algorithm reports the following quantity:
Problem 1.4 (Dynamic Weighted Edit Distance with Fixed Threshold).
Given integers and oracle access to a normalized weight function , maintain strings subject to updates (character edits in and ) and report upon each update.
Corollary 1.3 can be rephrased in terms of ˜1.4, but its major limitation is that it only applies to . To overcome this issue, we compose multiple hard instances from Theorem 1.2. This results in the following theorem, discussed in Section 5.
Theorem 1.5.
Suppose that ˜1.4 admits a solution with preprocessing time and update time for some non-decreasing functions and . If and hold for some parameters , , and , then the APSP Hypothesis fails.
Theorem 1.5 indicates that (conditioned on the APSP Hypothesis and up to subpolynomial-factor improvements), for a dynamic algorithm with preprocessing time for , the best possible update time is . The lower bound does not extend to ; precisely then, the -time static algorithm improves upon the update time.
Our Results: Algorithms
Our main positive result is an algorithm that achieves the fine-grained-optimal trade-off.
Theorem 1.6.
There exists an algorithm that, initialized with an extra real parameter , solves ˜1.4 with preprocessing time and update time .
As a warm-up, in Section 3, we prove Theorem 1.6 for . Similarly to [15], the main challenge is that updates can make and slide with respect to each other. If we interleave insertions at the end of and deletions at the beginning of , then, after updates, for every character , the fragment containing the plausible matches of changes completely. The approach of [15] was to restart the algorithm every updates.
Our preprocessing time is too large for this, so we resort to a trick originating from dynamic edit distance approximation [20]: we maintain a data structure for aligning with itself. Instead of storing , we show how to compute for any string specified using edits (of arbitrary costs) transforming into . To find such edits in the original setting of ˜1.4, one can use a dynamic unweighted edit distance algorithm. Already the folklore approach, based on [21, 23], with time per update is sufficiently fast.
In Section 4, we discuss how to generalize our dynamic algorithm to arbitrary . This is our most difficult result – it requires combining the techniques of [8] and their subsequent adaptations in [15] with several new ideas, including a novel shifted variant of self-edit distance. We derive Theorem 1.6 as a straightforward corollary of the following result, which demonstrates that the same fine-grained optimal trade-off can be achieved in a much more robust setting compared to the one presented in ˜1.4.
Theorem 1.7.
There is a dynamic algorithm that, initialized with a real parameter , integers , and -time oracle access to a normalized weight function , maintains a string and, after -time preprocessing, supports the following operations (updates and queries):
-
Apply a character edit, substring deletion, or copy-paste to in time.
-
Given edits transforming into a string , compute in time.
We remark that our query algorithm for extends a subroutine of [11], where the string is static. In [11], this result is used for approximate pattern matching with respect to weighted edit distance. If the pattern is far from periodic, which is arguably true in most natural instances, the trade-off of Theorem 1.7 yields a faster pattern matching algorithm.
Open Problems
Recall that Theorem 1.5 proves fine-grained optimality of Theorem 1.6 only for . The omission of stems from a gap between algorithms and lower bounds for the underlying static problem. Hence, we reiterate the open question of [8] asking for the complexity of computing when . Since our lower bounds arise from Theorem 1.2, it might be instructive to first try designing faster algorithms for the case of . Before this succeeds, one cannot hope for faster dynamic algorithms.
Wider gaps in our understanding of ˜1.4 remain in the regime of large preprocessing time. It is open to determine the optimal update time for and to decide whether can be beneficial. As far as is concerned, we believe that [9, Section 4] can be used to obtain update time, which improves upon Theorem 1.6 for . Moreover, by monotonicity, the lower bound of [8, Section 6] excludes update times of the form .
Finally, we remark that all our lower bounds exploit the fact that can change a lot as a result of a single update. It would be interesting to see what can be achieved if the weights are small (but fractional) or when the update time is parameterized by the change in . In these cases, we hope for non-trivial applications of the approach of [15].
2 Preliminaries333In this section, we partially follow the narration of [8, 15].
A string is a sequence of characters over an alphabet . We denote the set of all strings over by , we use to represent the empty string, and we write . For a position , we say that is the -th character of . We say that a string occurs as a substring of a string , if there exist indices satisfying such that . The occurrence of in specified by and is a fragment of denoted by , , , or .
Weighted Edit Distances and Alignment Graphs
Given an alphabet , we set . We call a normalized weight function if and, for , we have if and only if . Note that does not need to satisfy the triangle inequality nor does need to be symmetric.
We interpret weighted edit distances of strings as distances in alignment graphs.
Definition 2.1 (Alignment Graph, [8]).
For strings and a normalized weight function , we define the alignment graph to be a weighted directed graph with vertices and the following edges:
-
a horizontal edge of cost representing a deletion of , for every ,
-
a vertical edge of cost representing an insertion of , for every , and
-
a diagonal edge of cost representing a substitution of into or, if , a match of with , for every .
We visualize the graph as a grid graph with columns and rows, where and are the top-left and bottom-right vertices, respectively.
Alignments between fragments of and can be interpreted as paths in .
Definition 2.2 (Alignment).
Consider strings and a normalized weight function . An alignment of onto , denoted , is a path from to in , often interpreted as a sequence of vertices. The cost of , denoted by , is the total cost of edges belonging to .
The weighted edit distance of strings with respect to a weight function is where the minimum is taken over all alignments of onto . An alignment of onto is -optimal if .
We say that an alignment of onto aligns fragments and if . In this case, we also denote the cost of the induced alignment (the subpath of from to ) by .
A normalized weight function satisfying for distinct gives the unweighted edit distance (Levenshtein distance [22]). In this case, we drop the superscript .
Definition 2.3 (Augmented Alignment Graph, [15]).
For strings and a weight function , the augmented alignment graph is obtained from by adding, for every edge of , a back edge of weight .555These edges ensure that is strongly connected, and thus the distance matrices used throughout this work are Monge matrices (see Definition 2.11) rather than so-called partial Monge matrices.
The following fact shows several properties of . Importantly, is the distance from to in .
Fact 2.4 ([15, Lemma 5.2]).
Consider strings and a weight function . Every two vertices and of the graph satisfy the following properties:
- Monotonicity.
-
Every shortest path from to in is (weakly) monotone in both coordinates.
- Distance preservation.
-
If and , then
- Path irrelevance.
-
If ( and ) or ( and ), then every path from to in monotone in both coordinates is a shortest path between these two vertices.
Consistently with many edit-distance algorithms, we will often compute not only the distance from to in but the entire boundary distance matrix that stores the distances from every input vertex on the top-left boundary to every output vertex on the bottom-right boundary of the graph .
Definition 2.5 (Input and Output Vertices of an Alignment Graph, [15, Definition 5.5]).
Let be two strings and be a weight function. Furthermore, for fragments and , let be the corresponding rectangle in . We define the input vertices of as the sequence of vertices of on the left and top boundary of in the clockwise order. Analogously, we define the output vertices of as a sequence of vertices of on the bottom and right boundary of in the counterclockwise order. Furthermore, the input and output vertices of are the corresponding boundary vertices of the whole graph .
Definition 2.6 (Boundary Distance Matrix, [15, Definition 5.6]).
Let be two strings and be a weight function. The boundary distance matrix of and with respect to is a matrix of size , where is the distance from the -th input vertex to the -th output vertex of in .
The following fact captures the simple dynamic programming algorithm for bounded weighted edit distance.
Fact 2.7 ([15, Fact 3.3]).
Given strings , an integer , and -time oracle access to a normalized weight function , the value can be computed in time. Furthermore, if , the algorithm returns a -optimal alignment of onto .
The proof of Fact 2.7 relies on the fact that a path of length can visit only a limited area of the alignment graph . This property can be formalized as follows:
Fact 2.8 ([15, Lemma 5.3]).
Let be two strings, be an integer, and be a normalized weight function. Consider a path of cost at most connecting to in . All vertices satisfy and .
The breakpoint representation of an alignment of onto is the subsequence of consisting of pairs such that or does not match with . Note that the size of the breakpoint representation is at most and that it can be used to retrieve the entire alignment and its cost: for any two consecutive elements of the breakpoint representation, it suffices to add for . We will also talk of breakpoint representations of paths in that are weakly monotone in both coordinates. If such a path starts in and ends in with and , then the breakpoint representation of such a path is identical to the breakpoint representation of the corresponding alignment. Otherwise, if or , we call the breakpoint representation of such a path a sequence of all its vertices; recall that all edges in such a path have strictly positive weights.
Planar Graphs, the Monge Property, and Min-Plus Matrix Multiplication
As alignment graphs are planar, we can use the following tool of Klein.
Fact 2.9 (Klein [19]).
Given a directed planar graph of size with non-negative edge weights, we can construct in time a data structure that, given two vertices at least one of which lies on the outer face of , computes the distance in time. Moreover, the data structure can report the shortest path in time, where is the maximum degree of .
The following lemma specifies a straightforward variation of Klein’s algorithm tailored for path reconstruction in alignment graphs.
Lemma 2.10.
Given strings and -time access to a normalized weight function , we can construct in time a data structure that, given two vertices at least one of which lies on the outer face of , computes the distance in time. Moreover, the breakpoint representation of a shortest path can be constructed in time.
Proof.
The data structure we build is recursive. We first build the data structure of Klein’s algorithm (Fact 2.9) for , and then if , recurse onto and . Preprocessing costs time. For distance queries, the top-level instance of Klein’s algorithm is sufficient. It remains to answer path reconstruction queries. We reconstruct the shortest path recursively. First, in time we query . If the distance is , we return the breakpoint representation of the trivial shortest path. Otherwise, we make further recursive calls. If and both lie in or both lie in , by the monotonicity property of Fact 2.4 we can recurse onto the corresponding small recursive data structure instance. Otherwise, every path contains a vertex with . Furthermore, due to Fact 2.8, there are possible values for . In time we query and using the smaller recursive data structure instances and find some satisfying . We then recursively find the breakpoint representations of the optimal path and the optimal path in the smaller recursive data structure instances. There are recursive levels, on each one of them our total work is ; thus, the total time complexity is .
As discussed in [13, Section 2.3] and [15, Fact 3.7 and Section 5.1], the planarity of alignment graphs implies that the matrices satisfy the following Monge property:
Definition 2.11 (Monge Matrix).
A matrix of size is a Monge matrix if, for all and , we have .
In the context of distance matrices, concatenating paths corresponds to the min-plus product of matrices; this operation preserves Monge property and can be evaluated fast.
Fact 2.12 (Min-Plus Matrix Product, SMAWK Algorithm [3], [29, Theorem 2]).
Let and be matrices of size and , respectively. Their min-plus product is a matrix of size such that for all and .
If and are Monge matrices, then is also a Monge matrix. Moreover, can be constructed in time assuming -time random access to and .
3 -Time Updates after -Time Preprocessing
In this section, we present a dynamic algorithm that maintains for two dynamic strings with preprocessing time and update time. As all vertical and horizontal edges cost at least , the relevant part of the graph has size (i.e., a band of width around the main diagonal), and thus we can afford to preprocess this part of . Nevertheless, the curse of dynamic algorithms for bounded weighted edit distance is that, after updates (e.g., character deletions from the end of and character insertions at the beginning of ), the preprocessed part of the original graph can be completely disjoint from the relevant part of the updated graph , which renders the original preprocessing useless. To counteract this challenge, we use the idea from [20]: we dynamically maintain information about instead of . Every update to now affects both dimensions of , so the relevant part of the graph does not shift anymore, and hence our preprocessing does not expire. Tasked with computing upon a query, we rely on the fact that, unless , the graph is similar to .
Consistently with the previous works [8, 15], we cover the relevant part of with rectangular subgraphs of size each; see Figure 1. We preprocess each subgraph in time. A single update to alters a constant number of subgraphs , so we can repeat the preprocessing for such subgraphs in time. The string does not affect any subgraph ; we only store it in a dynamic strings data structure supporting efficient substring equality queries for the concatenation .
Let for be the set of vertices in the intersection of and ; see Figure 1. Furthermore, let and . Let be the union of all graphs . Let for with denote the matrix of pairwise distances from to in . Note that holds for all with .
Upon a query for , we note that only subgraphs in differ from the corresponding subgraphs in . (If , then , so we can assume that .) We call such subgraphs affected. Let , , and be the analogs of , , and in . Note that if , then is the only entry of the matrix . Furthermore, we have . We compute this product from left to right. For unaffected subgraphs , we have , and we can use precomputed information about . To multiply by for affected subgraphs , we use the following lemma, the proof of which can be found in the full version of the paper [6].
Lemma 3.1 (Generalization of [11, Claim 4.2]).
Let be nonempty strings and be a weight function. Given -time oracle access to , the strings and can be preprocessed in time so that, given a string and a vector of size , the row can be computed in time. Furthermore, given an input vertex and an output vertex of , the shortest path from to in can be computed in the same time complexity.
There are edits that transform into , and each edit affects a constant number of subgraphs . Therefore, applying Lemma 3.1 for all affected subgraphs takes time in total. Between two consecutive affected subgraphs and , we have , and thus it is sufficient to store some range composition data structure over to quickly propagate between subsequent affected subgraphs.
As the information we maintain dynamically regards and does not involve the string , we may generalize the problem we are solving. All we need upon a query is an alignment of onto of unweighted cost at most . Such an alignment can be obtained, for example, from a dynamic bounded unweighted edit distance algorithm [15] in time per update. Alternatively, it is sufficient to have -time equality tests between fragments of and . In this case, we can run the classical Landau–Vishkin [21] algorithm from scratch after every update in time to either get an optimal unweighted alignment or learn that . Efficient fragment equality tests are possible in a large variety of settings [10] including the dynamic setting [23, 4, 14, 18], so we will not focus on any specific implementation. Instead, we assume that the string is already given as a sequence of edits of an unweighted alignment transforming into , and our task is to find a -optimal alignment. We are now ready to formulate the main theorem of this section.
Theorem 3.2.
There is a dynamic data structure that, given a string , a threshold , and -time oracle access to a normalized weight function , can be initialized in time and allows for the following operations:
-
Apply a character edit to in time.
-
Given edits transforming into a string , compute in time. Furthermore, if , the algorithm returns the breakpoint representation of a -optimal alignment of onto .
Proof Sketch.
At the initialization phase, we decompose into subgraphs and, for each subgraph , initialize the algorithm of Lemma 3.1 and compute using Lemma 2.10. We also build a dynamic range composition data structure over , implemented as a self-balancing binary tree. Given a range , in time, this data structure returns (pointers to) matrices such that .
When an update to comes, it affects a constant number of subgraphs , which we locally adjust and recompute the initialization for in time.
Upon a query, we are tasked with computing . We locate the affected subgraphs and process them one-by-one. We use Lemma 3.1 to multiply the currently computed row by . The algorithm then uses the range composition data structure over to obtain matrices that correspond to the segment of matrices between the current affected subgraph and the next affected subgraph . We use SMAWK algorithm (Fact 2.12) to multiply by matrices . After processing all the matrices, we obtain , the only entry of which is equal to if . The applications of Lemma 3.1 work in time in total. Each of the blocks of unaffected subgraphs is treated by invoking SMAWK algorithm (Fact 2.12) times for a total of time.
See the full version of the paper [6] for a formal proof including the alignment reconstruction procedure. We complete this section with a simple corollary improving the query time if .
Corollary 3.3.
There is a dynamic data structure that, given a string , a threshold , and -time oracle access to a normalized weight function can be initialized in time and allows for the following operations:
-
Apply a character edit to in time.
-
Given edits transforming into a string , compute in time, where . Furthermore, if , the algorithm returns the breakpoint representation of a -optimal alignment of onto .
Proof.
We maintain the data structures of Theorem 3.2 for thresholds . Upon a query, we query the data structures for subsequent thresholds starting from until we get a positive result or reach .
As discussed earlier, in a variety of settings, upon a query, in time we can compute an optimal unweighted alignment of onto or learn that . Therefore, we may assume that whenever Corollary 3.3 is queried, we have . In this case, the query time complexity can be rewritten as .
Due to the robustness of the setting of Corollary 3.3, there are many problems beyond dynamic weighted edit distance that can be solved using Corollary 3.3. For example, given strings , we can find for all in time, compared to time arising from applications of the optimal static algorithm [8].
4 Trade-Off Algorithm: Technical Overview
In the full version of the paper [6], we generalize the result of Theorem 3.2 in two ways. First, we extend the set of updates the string supports from character edits to block edits, which include substring deletions and copy-pastes. More importantly, rather than having preprocessing time and query time, we allow for a complete trade-off matching the lower bound of Theorem 1.5, i.e., preprocessing and update time for any .
A major challenge for is that we cannot preprocess the whole width- band around the main diagonal of . In the proof of Theorem 3.2, we cover the band with -sized subgraphs and preprocess each of them in time for a total of . To achieve -time preprocessing, we are forced to instead use -sized subgraphs and still spend just preprocessing time per subgraph.
To remedy the shortage of preprocessing time, we use the ideas behind the recent static algorithms [8, 15]. At the expense of a logarithmic-factor slowdown, they allow us to answer queries only for “repetitive” fragments of rather than the whole . More precisely, we are tasked to answer queries involving fragments of satisfying , where the self-edit distance is the minimum unweighted cost of an alignment of onto itself that does not match any character of with itself. We can use this to our advantage: the subgraphs corresponding to more “repetitive” fragments allow for more thorough preprocessing in time since we can reuse some computations. On the other hand, upon a query, we can afford to spend more time on subgraphs with less “repetitive” fragments as their number is limited due to .
This approach requires the repetitiveness measure to be super-additive: . Unfortunately, self-edit distance is sub-additive: . To circumvent this issue, we introduce a better-suited notion of repetitiveness we call -shifted self-edit distance : a relaxed version of allowed to delete up to first characters of and insert up to last characters of for free. In contrast to self-edit distance, the -shifted variant satisfies if .
By using the value of to determine the level of preprocessing performed on , upon a query, similarly to Theorem 3.2, we have “interesting” subgraphs (subgraphs that are either affected by the input edits or satisfy ) that we process one-by-one in time each using a generalization of Lemma 3.1. The super-additivity property of guarantees that this sums up to a total of . The remaining subgraphs are not affected by the input edits and satisfy , which means that has a period of at most . For such subgraphs , we can afford the complete preprocessing and, upon a query, processing such subgraphs in chunks in time in total precisely as in the proof of Theorem 3.2. Furthermore, similarly to Theorem 3.2, all the information the data structure stores is either local or cumulative, and thus block edits can be supported straightforwardly. This rather simple high-level approach runs into several low-level technical complications that are discussed in detail in the full version of the paper [6].
5 Lower Bounds: Technical Overview
The reduction provided in [8] is insufficient because the instances it produces satisfy rather than . Still, we reuse hardness of the following problem:
Problem 5.1 (Batched Weighted Edit Distance [8]).
Given a batch of strings , a string , a threshold , and oracle access to a weight function , decide if .
The hard instances of ˜5.1, which cannot be solved in time for any assuming the APSP Hypothesis, satisfy many technical properties, including , where is the Hamming distance, i.e., the number of positions in which differs from .
The approach of [8, Section 7] is to construct strings and that take the following form if we ignore some extra gadgets: and , where is chosen so that . The ignored gadgets let us focus on alignments that satisfy the following properties for :
-
the fragment is aligned with the copy of in located between and ;
-
the copies of in are matched with the remaining copies of in ;
-
all the characters within the fragments and of are inserted;
-
for , the fragment is aligned with if and with if .
This ensures that the cost of is equal to a baseline cost (independent of ) plus .
Our reduction uses three types of “ gadgets”: , , and , with the latter two playing similar roles. We also introduce a new symbol that is extremely expensive to delete, insert, or substitute with any other symbol. Modulo some extra gadgets, our strings are of the following form:
Notice that removing the two symbols from or from results in the same string, so we have . Due to the expensive cost for editing a , any optimal alignment must match the two symbols in with the two symbols in , deleting a prefix of and a suffix of for some predictable cost. Thus, we can focus our analysis on , where
Observe that both and consists of “ gadgets” interleaved with , and that contains one more copy of and one more “ gadget”. Analogously to [8], the ignored extra gadgets let us focus on the alignments satisfying the following properties for :
-
the fragment in is aligned with the copy of in located between and ;
-
the copies of in are matched with the remaining copies of in ;
-
all characters within the fragments and of are inserted;
-
the remaining “ gadgets” in are aligned with the remaining “ gadgets” in .
To perfectly mimic [8], we should ensure that all the “ gadgets” that we can possibly align are at weighted edit distance . We only managed to construct and so that:
-
, and
-
.
Although we have two different distances between the relevant pairs of “ gadgets”, it is easy to see the number of pairs of either type is constant across all the alignments . This is sufficient to guarantee that the cost of is equal to some baseline cost plus . Moreover, as in [8], the costs of alignments is and, if we denote , then the lower bound derived from the lower bound for ˜5.1 excludes running times of the form .
Lower Bounds for Dynamic Weighed Edit Distance
From Theorem 1.2 we derive two dynamic lower bounds, each justifying a different limitation of our algorithm given in Theorem 1.6. Our first lower bound, formalized in the following lemma and proved in the full version of the paper [6], concerns our choice to fix a threshold at the preprocessing phase. If, instead, we aimed for time complexities depending on the current value of , then, for sufficiently large values of , we could not achieve update times improving upon the static algorithm.
Lemma 5.2.
Consider the following dynamic problem: Given an integer and -time oracle access to a normalized weight function , maintain two initially empty strings that throughout the lifetime of the algorithm satisfy subject to updates (character edits) and output after every update. Suppose that there is an algorithm that solves this problem with preprocessing and update time, where is non-decreasing. If , , and hold for some real parameters and , then the APSP Hypothesis fails.
Our second dynamic lower bound justifies the specific trade-off between the preprocessing and update time in Theorem 1.6. In simple words, we prove that, with preprocessing time for , the best possible update time is . For that, we note that Theorem 1.2 is based on a reduction from the Negative Triangle problem, which is self-reducible: solving instances of bounded weighted edit distance from Theorem 1.2 is hence, in general, times harder than solving a single instance. Given such instances , we initialize a dynamic algorithm with a pair of equal strings , where is an auxiliary symbol that is very expensive to edit. For , in updates, we can transform the fragment of into and retrieve . By applying and reverting these updates for every , we can thus find for all . If we pick for an arbitrary small constant , then the static lower bound requires total time. Our preprocessing time is asymptotically smaller, so, among updates, some must take time. See the full version of the paper [6] for a formal proof.
References
- [1] Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS 2015, pages 59–78. IEEE Computer Society, 2015. doi:10.1109/FOCS.2015.14.
- [2] Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made. In STOC 2016, pages 375–388. ACM, 2016. doi:10.1145/2897518.2897653.
- [3] Alok Aggarwal, Maria M. Klawe, Shlomo Moran, Peter W. Shor, and Robert E. Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2:195–208, 1987. doi:10.1007/BF01840359.
- [4] Stephen Alstrup, Gerth Stølting Brodal, and Theis Rauhe. Pattern matching in dynamic texts. In SODA 2000, pages 819–828. ACM/SIAM, 2000. URL: http://dl.acm.org/citation.cfm?id=338219.338645.
- [5] Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM J. Comput., 47(3):1087–1097, 2018. doi:10.1137/15M1053128.
- [6] Itai Boneh, Egor Gorbachev, and Tomasz Kociumaka. Bounded weighted edit distance: Dynamic algorithms and matching lower bounds, 2025. arXiv:2507.02548v1.
- [7] Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In FOCS 2015, pages 79–97. IEEE Computer Society, 2015. doi:10.1109/FOCS.2015.15.
- [8] Alejandro Cassis, Tomasz Kociumaka, and Philip Wellnitz. Optimal algorithms for bounded weighted edit distance. In FOCS 2023, pages 2177–2187. IEEE, 2023. doi:10.1109/FOCS57990.2023.00135.
- [9] Panagiotis Charalampopoulos, Tomasz Kociumaka, and Shay Mozes. Dynamic string alignment. In CPM 2020, volume 161 of LIPIcs, pages 9:1–9:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.CPM.2020.9.
- [10] Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In FOCS 2020, pages 978–989. IEEE, 2020. doi:10.1109/FOCS46700.2020.00095.
- [11] Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Pattern matching under weighted edit distance. In FOCS 2025, 2025.
- [12] Debarati Das, Jacob Gilbert, MohammadTaghi Hajiaghayi, Tomasz Kociumaka, and Barna Saha. Weighted edit distance computation: Strings, trees, and Dyck. In STOC 2023, STOC 2023, pages 377–390, New York, NY, USA, 2023. Association for Computing Machinery. doi:10.1145/3564246.3585178.
- [13] Jittat Fakcharoenphol and Satish Rao. Planar graphs, negative weight edges, shortest paths, and near linear time. J. Comput. Syst. Sci., 72(5):868–889, 2006. doi:10.1016/j.jcss.2005.05.007.
- [14] Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In SODA 2018, pages 1509–1528. SIAM, 2018. doi:10.1137/1.9781611975031.99.
- [15] Egor Gorbachev and Tomasz Kociumaka. Bounded edit distance: Optimal static and dynamic algorithms for small integer weights. In 57th Annual ACM Symposium on Theory of Computing, STOC 2025, pages 2157–2166, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3717823.3718168.
- [16] Russell Impagliazzo and Ramamohan Paturi. On the complexity of -SAT. J. Comput. Syst. Sci., 62(2):367–375, 2001. doi:10.1006/jcss.2000.1727.
- [17] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 63(4):512–530, 2001. doi:10.1006/jcss.2001.1774.
- [18] Dominik Kempa and Tomasz Kociumaka. Dynamic suffix array with polylogarithmic queries and updates. In STOC 2022, pages 1657–1670. ACM, 2022. doi:10.1145/3519935.3520061.
- [19] Philip N. Klein. Multiple-source shortest paths in planar graphs. In SODA, pages 146–155. SIAM, 2005. URL: http://dl.acm.org/citation.cfm?id=1070432.1070454.
- [20] Tomasz Kociumaka, Anish Mukherjee, and Barna Saha. Approximating edit distance in the fully dynamic model. In FOCS 2023, pages 1628–1638. IEEE, 2023. doi:10.1109/FOCS57990.2023.00098.
- [21] Gad M. Landau and Uzi Vishkin. Fast string matching with differences. J. Comput. System Sci., 37(1):63–78, 1988. doi:10.1016/0022-0000(88)90045-1.
- [22] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk SSSR, 163:845–848, 1965. URL: http://mi.mathnet.ru/eng/dan31411.
- [23] Kurt Mehlhorn, Rajamani Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17:183–198, 1997. doi:10.1007/bf02522825.
- [24] Eugene W. Myers. An difference algorithm and its variations. Algorithmica, 1(2):251–266, 1986. doi:10.1007/BF01840446.
- [25] Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48(3):443–453, 1970. doi:10.1016/b978-0-12-131200-8.50031-9.
- [26] Peter H. Sellers. On the theory and computation of evolutionary distances. SIAM J. Appl. Math., 26(4):787–793, 1974. doi:10.1137/0126070.
- [27] Peter H. Sellers. The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms, 1(4):359–373, 1980. doi:10.1016/0196-6774(80)90016-4.
- [28] Alexander Tiskin. Semi-local string comparison: Algorithmic techniques and applications. Math. Comput. Sci., 1(4):571–603, 2008. doi:10.1007/s11786-007-0033-3.
- [29] Alexander Tiskin. Fast distance multiplication of unit-Monge matrices. Algorithmica, 71(4):859–888, 2015. doi:10.1007/S00453-013-9830-Z.
- [30] Esko Ukkonen. Finding approximate patterns in strings. J. Algorithms, 6(1):132–137, 1985. doi:10.1016/0196-6774(85)90023-9.
- [31] Virginia Vassilevska Williams and R. Ryan Williams. Subcubic equivalences between path, matrix, and triangle problems. Journal of the ACM, 65(5):27:1–27:38, 2018. doi:10.1145/3186893.
- [32] Taras K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52–57, 1968. doi:10.1007/BF01074755.
- [33] Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974. doi:10.1145/321796.321811.
