On the I/O Complexity of the Cocke-Younger-Kasami Algorithm and of a Family of Related Dynamic Programming Algorithms
Abstract
Asymptotically tight lower bounds are derived for the Input/Output (I/O) complexity of a class of dynamic programming algorithms, including matrix chain multiplication, optimal polygon triangulation, and the construction of optimal binary search trees. Assuming no recomputation of intermediate values, we establish an I/O lower bound, where denotes the size of the input and denotes the size of the available fast memory (cache). When recomputation is allowed, we show that the same bound holds for , where is a positive constant. In the case where , we show an I/O lower bound. We also discuss algorithms for which the number of executed I/O operations matches asymptotically each of the presented lower bounds, which are thus asymptotically tight.
Additionally, we refine our general method to obtain a lower bound for the I/O complexity of the Cocke-Younger-Kasami algorithm, where the size of the grammar impacts the I/O complexity. An upper bound with asymptotically matching performance in many cases is also provided.
Keywords and phrases:
I/O complexity, Dynamic Programming Algorithms, Lower Bounds, Recomputation, Cocke-Younger-KasamiCopyright and License:
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithmsEditors:
Pat Morin and Eunjin OhSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The performance of computing systems is significantly influenced by data movement, affecting both time and energy consumption. This ongoing technological trend [37] is expected to continue as fundamental limits on minimum device size and maximum message speed lead to inherent costs associated with data transfer, be it across the levels of a hierarchical memory system or between processing elements in a parallel system [10]. While the communication requirements of algorithms have been extensively studied, deriving significant lower bounds based on these requirements and matching upper bounds remains a critical challenge.
Dynamic Programming (DP) is a classic algorithmic framework introduced by Bellman [4] that is particularly effective in optimization tasks where problems can be broken down into overlapping subproblems with optimal substructure and storing the solutions to these subproblems to prevent redundant work. DP is widely used in computer science and other important fields such as control theory, operations research, and computational biology.
In this work, we analyze the memory access cost (I/O complexity) of a family of DP algorithms when executed in a two-level storage hierarchy with words of fast memory (cache). The analyzed algorithms share a similar structure of dependence on the results of subproblems and include, among others, the classic DP algorithms for matrix chain multiplication, boolean parenthesization, optimal polygon triangulation, and construction of optimal binary search trees. We obtain asymptotically tight lower bounds for the I/O complexity of these algorithms both for schedules in which no intermediate values are computed more than once and for general schedules allowing recomputation. Our analysis reveals that, for a range of values of the size of the cache, recomputing intermediate values can lead to an asymptotically lower number of I/O operations. We further refine our analysis to derive an I/O lower bound for the Cocke-Younger-Kasami (CYK) algorithm [20, 47, 32] and an (almost) asymptotically matching upper bound.
Previous and related work.
I/O complexity was introduced in the seminal work by Hong and Kung [31]; it denotes the number of data transfers between the two levels of a memory hierarchy with a cache of words and a slow memory of unbounded size. Hong and Kung presented techniques to develop lower bounds for the I/O complexity of computations modeled by Computational Directed Acyclic Graphs (CDAGs).
While the “edge expansion technique” of [3], the “path routing technique” of [40], and the “closed dichotomy width” technique of [11] all yield I/O lower bounds that apply only to computational schedules for which no intermediate result is ever computed more than once (nr-computations) it is also important to investigate what can be achieved with recomputation. In fact, for some CDAGs, recomputing intermediate values reduces the space and/or the I/O complexity of an algorithm [39]. In [8], it is shown that some algorithms admit a portable schedule (i.e., a schedule that achieves optimal performance across memory hierarchies with different access costs) only if recomputation is allowed. Recomputation can also enhance the performance of simulations among networks (see [34] and references therein) and plays a key role in the design of efficient area-universal VLSI architectures. Several lower-bound techniques that allow recomputation have been presented in the literature, including the “S-partition technique” [31], the “S-span technique” [39], the “S-covering technique” [9] (which merges and extends aspects from both [31] and [39]), the “G-flow technique” [5, 6], and the “visit partition technique [7]. However, to the best of our knowledge, none of these have been previously applied to the family of DP algorithms considered in this work.
A systematic study of Dynamic Programming was introduced by Bellman [4], and has since become a popular problem-solving methodology. Galil and Park [27] produce a survey and classifications of popular dynamic programming algorithms and optimization methods. Several results have been presented in the literature aimed at optimizing the running time of Dynamic Programming algorithms [28, 33, 46, 27]. Notably, Valiant [45] showed that context-free recognition can be carried out as fast as Boolean matrix multiplication.
Cherng and Ladner [13] provide a cache-oblivious divide-and-conquer algorithm and a cache-aware blocked algorithm for simple dynamic programs, which includes matrix chain multiplication and the construction of optimal binary search trees. Their approach is based on Valiant’s algorithm [45] and incurs cache misses. In [16, 17], Choudhury and Ramachandran introduce the Gaussian Elimination Paradigm (GEP) that provides cache-oblivious algorithms for problems including all-pairs shortest path. The authors also provide a way to transform simple DP problems into GEP problems. A proof of the optimality of the proposed algorithm for GEP problems is provided based on a similar result given in [31] for the matrix multiplication problem. As we show, this result does not extend to simple DP algorithms, as fewer cache misses can be achieved in some scenarios. Cache-oblivious DP algorithms for other problems have been explored [36, 19, 22]. There has also been work on parallelizing cache-efficient DP algorithms [18, 14, 30, 44, 43, 25, 23, 12], experimental evaluations of the resulting performance improvement [38], and on the automated discovery of cache-efficient DP algorithms [15, 29].
Dunlop et al. [24] provide an empirical analysis of various strategies to reduce the number of accesses to the grammar for the CYK algorithm. There have also been other works on modifying a grammar representation to improve efficiency. Notably, Song et al. [42] provides an empirical analysis of different grammar binarization approaches to improve efficiency, while Lange et al. [35] argue for the merits of a modified version of CYK to work with grammars in binary normal form instead of Chomsky normal form.
Summary of results.
In this work, we analyze the I/O complexity of a family of DP algorithms that follow the general strategy outlined in Prototype Algorithm 1 when executed in a two-level storage hierarchy with words of cache. The analyzed algorithms share a similar structure of dependence on the results of subproblems while allowing for diverse implementation choices and optimal problem substructure.
We analyze both schedules in which no intermediate values are computed more than once (no-recomputation schedules) and general ones that allow for recomputation. For both settings, we derive an lower bound to the number of I/O operations required by sequential execution of these algorithms on an input of size in a two-level storage hierarchy with a cache of size , for some positive constant value and such that memory words can be moved between consecutive memory locations of the cache and of the slow memory with a single I/O operation. However, we show that while no-recomputation schedules require I/O operations even when using a cache of size at least and , schedules using recomputation achieve an asymptotic reduction of the number of required I/O operations by a factor . This is particularly significant as in many cases of interest (e.g., Matrix Multiplication [2, 5, 31], Fast Fourier Transform [31], Integer Multiplication [6]) recomputation has shown to enable a reduction of the I/O cost by at most a constant multiplicative factor. These results are obtained by analyzing a representation of the considered algorithms as Computational Directed Acyclic Graphs (CDAGs) whose structure captures the structure of dependence between subproblems, which is common to the DP algorithms considered in our work. We provide a general construction of such CDAGs and analyze their internal connection properties.
We refine our general method to obtain an lower bound for the I/O complexity of the Cocke-Younger-Kasami algorithm when deciding whether an input string of length is a member of the language generated by a given Context-Free Grammar with rules with distinct right-hand-sides not including terminals. While the CYK algorithm exhibits a structure of subproblem dependencies similar to that of previously analyzed problems, it presents several challenging differences which we address by modifying our lower-bound methods by further refining its CDAG representation. Finally, we present a cache-oblivious implementation of the CYK algorithm for which, depending on the composition of the rules of the considered CFG, the number of I/O operations to be executed almost asymptotically matches the lower bound.
2 Preliminaries
We consider a family of Dynamic Programming (DP) algorithms following the general strategy outlined in Prototype Algorithm 1: Given an input of size , the solution, denoted as is computed bottom up. First, the values corresponding to the solution of subproblems of input size one, for , are initialized to some set default value. The results of subproblems corresponding to parts of the input of growing size , that is for and , are computed by first evaluating the compositions of pairs of subproblems and , for and then composing them according to the optimal subproblem structure used in the algorithm.
Depending on the specific problems, the default initialization values (INITIALIZATION _VALUE in the Prototype Algorithm) of , the way that the pairs and are combined (COMBINE), and the way these results are composed to evaluate (AGGREGATE) may differ. However, AGGREGATE is assumed to be commutative and associative. Besides these differences, all the considered subproblems share the subproblem dependence structure outlined in the Prototype Algorithm, for which a visual representation is provided in Figure 1(a). Examples of algorithms following such a structure include the classic DP algorithm for the matrix chain multiplication problem, the optimal convex polygon triangulation, the construction of optimal binary search trees, and the Cocke-Younger-Kasami algorithm (Algorithm 5).
We provide a general analysis of the I/O complexity of algorithms following the structure of the Prototype Algorithm , and we then refine our analysis for the mentioned algorithms of interest.
Computational Directed Acyclic Graphs.
Our analysis is based on modeling the execution of the DP algorithms of interest as a Computational Directed Acyclic Graph (CDAG) . Each vertex represents either an input value or the result of an operation (i.e., an intermediate result or one of the output values) which is stored using a single memory word. The directed edges in represent data dependencies. That is, vertices are connected by an edge directed from to if and only if the value corresponding to is an operand of the unit time operation computing the value corresponding to . Vertex is said to be a predecessor of and is said to be a successor of . For a given vertex , the set of its predecessors (resp., successors) is defined as (resp., . We refer to vertices with no predecessors (resp., successors) as the input (resp., output) vertices of a CDAG. We say that is a sub-CDAG of if and . We say that two sub-CDAGs and of are vertex-disjoint if .
I/O model.
We assume that sequential computations are executed on a system with a two-level memory hierarchy, consisting of a cache of size (measured in memory words) and a slow memory of unlimited size. An operation can be executed only if all its operands are in the cache. We assume that the processor is equipped with standard elementary logic and algebraic operations. Data can be moved from the slow memory to the cache by read operations, and, in the other direction, by write operations. These operations are called I/O operations. We assume that the input data is stored in the slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed using the “red-blue pebble game” [31]. The number of I/O operations executed when evaluating a CDAG depends on the “computational schedule”, that is, on the order in which vertices are evaluated and on which values are kept in/discarded from the cache. The I/O complexity of a CDAG corresponding to the execution of algorithm for an input of size , denoted as , is defined as the minimum number of I/O operations over all possible computational schedules. We further consider a generalization of this model known as the “External Memory Model” by Aggarwal and Vitter [1] where memory words can be moved between consecutive memory locations of the cache and of the slow memory with a single I/O operation.
3 CDAG construction
In this section, we give the construction for a family of CDAGs representing the execution of a class of DP algorithms exhibiting a subproblem optimality structure as the one outlined for the Prototype Algorithm 1.
Given input size , the CDAG corresponding to the execution of is constructed using the CDAG in Figure 1(a) as the base structure. The actual CDAG is obtained by refining the basic structure by replacing each vertex corresponding to a subproblem with a directed binary tree DAG with leaves with all edges directed towards the root. The root vertex (-vertex for short) of the tree DAG, henceforth referred to as , corresponds to the computation of the solution of the associated subproblem . The -th “leaf vertex” (-vertex for short) of the tree, for results from the combination of subproblems and and has as predecessors the -vertices of the tree DAGs corresponding respectively to and (i.e., and (a visual representation is provided in Figure 1(b)). Notice that binary trees corresponding to different -vertices are vertex disjoint. For each such tree DAG, we say that its -vertices belong to the root .
Our lower bound results hold for any possible structure of these tree DAGs. This is a feature of our model meant to accommodate a variety of possible implementations for and any other DP sharing the same substructure.
As an example, for the classic DP algorithm for the matrix chain multiplication problem (presented in the appendix of the extended online version [21]), the root vertex of each tree sub-CDAG corresponds to the computation of the minimum number of operations required to compute the chain product of the matrices from the -th to the -th, and the -th leaf, for , corresponds to the computation of .
We denote the family of CDAGs constructed in such a way with input vertices and differing in tree sub-DAGs used to accumulate the values associated to -vertices into the values associated with the corresponding -vertex as . This family of CDAGs captures the dependence structure between subproblems common to the family of DP algorithms that are the focus of our analysis.
Analysis of CDAG properties.
Given a CDAG from the family , let (resp. ) denote the set its of -vertices (resp. -vertices). Then, and .
The “-th row” (resp., the “-th column”) of , written (resp., ), is defined as the subset of including all of the -vertices each corresponding to the computation of subproblem for (resp., ). For any , if and then . By construction, the rows (resp., the columns ) partition , and (resp., ) for . A visual representation of rows and columns can be found in Figure 2(a).
By construction, each -vertex has a distinct pair of -vertices as its predecessors. Such a pair is said to interact in the computation. Given , it interacts with all and only the vertices for , and the vertices for . All such interactions correspond to the computation of a -vertex belonging to a distinct sub-tree corresponding to the subproblem or (i.e., to the -vertex or . Pairs of vertices in the same row (or column) do not interact.
We define , for as the subsets of such that each vertex in interacts with all the vertices of (a visual representation can be found in Figure 2(b)). By construction, . A vertex (resp, ) is only a member of (resp., ). All remaining vertices are members of both and , which are distinct as, by construction, for any we have . We refer to the collection of sets as the “-cover” of the -vertices of CDAG . By its definition, and by the properties of the interactions between vertices according to placement in rows and columns discussed in the previous paragraphs, we have the following property:
Lemma 1.
Given a CDAG , let denote the -cover of its -vertices. For any set there are at most pairs of vertices , where and , that interact as predecessors of distinct -vertices each of which belongs to a distinct -vertex.
4 I/O analysis for computations of CDAGs in
We analyze the I/O complexity of CDAGs in the class according to the Red-Blue Pebble Game by Hong and Kung [31]. To simplify the presentation, we assume that and are powers of , i.e., and for integers . If that is not the case, our analysis can be adapted with minor, albeit tedious, modifications resulting in different constant terms.
4.1 I/O lower bound for computations with no recomputation
In this section, we present a lower bound to the I/O complexity of algorithms corresponding to CDAGs in , assuming that no intermediate value is ever recomputed. (i.e., under the “no-recomputation assumption”). We first introduce a useful technical Lemma:
Lemma 2.
Given non-negative real numbers such that , where , and , where and , then, .
Proof.
The lower bound comes from the intuitive strategy of making each as large as possible, i.e. if , then assign and . In this case, . To show the optimality of this strategy, we show that any other configuration is suboptimal.
Consider any other assignment of that is not just a permutation of the same values. By assumption, any such assignment must have at least two non-zero values . Let . If , since , there exists some such that and , showing that our assignment does not minimize .
Alternatively, if , then (the last step follows from the fact that defines an upwards-facing parabola with an axis of symmetry about ). This means that there exists some such that and , again showing that our assignment does not minimize .
The following property captures the relation between subsets of -vertices and their number of distinct -vertex predecessors:
Lemma 3.
Let such that vertices in belong to the tree sub-CDAGs of at most distinct -vertices. The set containing all -vertices that are predecessors of at least one vertex in has cardinality at least .
Proof.
Consider the -cover of as defined in Section 3. As any -vertex can be a member of at most two distinct ’s we have:
| (1) |
By definition, each vertex in has a distinct pair of -vertex predecessors which, by construction, are both included in exactly one set . Let be the number of vertices from with both predecessors in . By the assumptions we have:
| (2) |
By Lemma 1, for any , each interacting pair of -vertices in it are the predecessors of a distinct -vertex belonging to the tree sub-CDAG of a distinct -vertex. From the assumption that vertices in belong to at most distinct roots, we have:
| (3) |
Furthermore, from Lemma 1, we know that vertices in can interact to produce at most vertices in . Thus,
| (4) |
By (1), (4), (2), (3), and Lemma 2 (in this order) we have:
Lemma 3 allows us to claim that to compute a set of -vertices, it is necessary to access a minimum number of vertices, which, due to the non-recomputation assumption, must either be in the cache or loaded into it using a read I/O operation.
Theorem 4.
For any CDAG let denote an algorithm corresponding to it. The I/O complexity of an Algorithm when run without recomputing any intermediate value using a cache memory of size and where for each I/O operation it is possible to move up to memory words stored in consecutive memory locations from the cache to slow memory or vice versa, is:
| (5) |
Proof.
We prove the result for . The result then trivially generalizes for a generic .
The fact that follows trivially from our assumption that the input is initially stored in the slow memory and therefore must be loaded into the cache at least once using read I/O operations.
Let be any computation schedule for the sequential execution of using a cache of size in which no value corresponding to any vertex of is computed more than once. We partition into non-overlapping segments such that during each exactly values corresponding to distinct -vertices, denoted as , are evaluated. Since there are such segments. Given , we refer to the set of -vertices to whom at least one of the vertices in belongs as . These correspond to vertices which are active during . For each interval, we denote as the number of -vertices whose value is completely evaluated during . As no value is computed more than once . Let denote the number of I/O operations executed during . We consider two cases:
-
The vertices of belong to at least distinct -vertices: For each of the values corresponding to -vertrices that were partially computed in , a partial accumulator of the values corresponding to their leaves computed during must either be in the cache at the end of or must be written into slow memory by means of a write I/O operation. Hence, .
-
The vertices of belong to at most distinct -vertices: By Lemma 3, the vertices in have at least -vertex predecessors, of whom at most are computed during . As the considered schedule’s values are not computed more than once, each of the remaining values corresponding to predecessor vertices of must either be in the cache at the beginning of , or read into the cache by means of a read I/O operation. Hence,
The I/O requirement of the algorithm is obtained by combining the cost of each non-overlapping segment:
By Theorem 4, we have that for , for a sufficiently small constant value , computations of CDAGs in in which no value is recomputed require I/O operations.
On the tightness of the bound
In [13], Cherng and Ladner presented the Valiant’s DP-Closure Algorithm, henceforth referred to as VDP, which allows computing algorithms following the general structure outlined in Prototype Algorithm 1 cache-obliviously and without recomputation. When , VDP executes I/O operations, and for it executes I/O operations. Thus, we can conclude that our I/O lower bound for schedules with no recomputation provided in Theorem 4 is asymptotically tight.
VDP utilizes a divide and conquer approach, making recursive calls to smaller subproblems. Here we give a simplified presentation of the Valiant’s DP-Closure Algorithm. Please refer to Cherng and Ladner [13] for a detailed presentation.
Consider any algorithm following the Prototype Algorithm 1 with input of size , where we assume for . All subproblems computed by the algorithm are associated with entries of an matrix so that is represented at . Initially, the values corresponding to values for are computed from the input while the remaining entries in are set to the LEAST_OPTIMAL_VALUE (e.g., 0). To fully compute all the subproblems (i.e., to compute the DP-closure), VDP splits into equally-sized submatrices of dimension 222We assume that n is a power of . This can always be made the case by padding labeled as follows:
We omit the representation of the lower-triangular submatrices as these will have all of their entries set to . Valiant’s DP-Closure Algorithm (whose pseudocode is presented in Algorithm 2) is given the matrix as input, with the only non-zero values corresponding to subproblems for . The values of the (i.e., the -vertices) corresponding to the top-left and bottom-right submatrices are computed through recursive calls, exploiting the fact that these submatrices correspond to a smaller instantiation of the same DP problem. The remaining values in are then computed by invoking a subroutine called “Valiant’s Star Algorithm” (VSTAR).
VSTAR is given as input a matrix where the subproblems in the top-left and bottom-right quadrants are already computed. The algorithm then recursively computes the values corresponding to the remaining entries of (i.e., the values of the -vertices). The pseudocode for this function can be found in Algorithm 3. We represent COMBINE with “” and AGGREGATE with “”. Operations involving these operators in Lines 6,8,10, and 11 are computed using a subroute called “Matrix multiply and accumulate” that, given three equally sized square matrices , , and , computes . In the context of VDP, contains partially computed values ’s (i.e., corresponding to -vertices), and represents the computation of combinations of subproblems contributing to (i.e., for , that is, the values corresponding to the -vertices, which are then used to update . A recursive in-place implementation of this function is presented in detail in [13, 26].
4.2 I/O lower bound for computations with allowed recomputation
We extend our I/O lower bound analysis to general computations in which values may be recomputed by combining it with the dominator I/O lower bound technique by Hong and Kung [31].
Definition 5 ([31]).
Given , a set is a dominator set for if every path directed from any input vertex of to a vertex in contains a vertex of .
Dominator sets of -vertices of a CDAG must satisfy the following property:
Lemma 6.
Given , let be a subset of -vertices where is the -vertex corresponding to subproblem . Any dominator set of has minimum cardinality
Proof.
Suppose for every , contains a vertex internal to the tree-CDAG rooted at . As the tree-CDAGs of different -vertices are disjoint, this would imply .
Otherwise, there must be some , such that does not contain any of its internal vertices. Let denote the set of -vertices corresponding to . It is sufficient to show that there are vertex disjoint paths from the inputs of to vertices in . By construction, each leaf has a predecessor belonging to a unique column . In turn, each of these predecessors is connected through its internal tree-CDAG to the input vertex in the same column (). In this manner, paths for are obtained. Since each path contains -vertices from a different column, and the tree-CDAGs of different -vertices are disjoint, the paths outlined above are vertex-disjoint.
Theorem 7.
For any CDAG let denote an algorithm corresponding to it. The I/O-complexity of when run using a cache memory of size and where for each I/O operation it is possible to move up to memory words stored in consecutive memory locations from cache to slow memory or vice versa, is:
| (6) |
Proof.
To simplify our analysis, we only consider parsimonious execution schedules such that each time an intermediate result is computed, the result is then used to compute at least one of the values of which it is an operand before being removed from the memory (either the cache or slow memory). Any non-parsimonious schedule can be reduced to a parsimonious schedule by removing all the steps that violate the definition of parsimonious computation. has therefore less computational or I/O operations than . Hence, restricting the analysis to parsimonious computations leads to no loss of generality.
We prove the result for the case . The result then trivially generalizes for a generic . The fact that follows trivially from our assumption that input is initially stored in slow memory and therefore must be loaded into the cache at least once using read I/O operations.
Let be the set of -vertices in whose predecessor -vertices both correspond to some subproblems such that . By the construction, for each subproblem the corresponding tree sub-CDAG has such -vertices. Since the sub-CDAGs corresponding to each do not share vertices, we have:
Let be any computation schedule for the sequential execution of using a cache of size . We partition into segments such that during each exactly values corresponding to distinct vertices in are computed from their operands (i.e., not read from the cache). Let denote the set of vertices corresponding to these values. Below we argue that the number of I/O operations executed during each of the non-overlapping segments is at least , from whence the statement follows.
Let denote the set of vertices corresponding to the values computed during . Clearly . Let denote the set of vertices corresponding to either the at most values that are either stored in the cache or the values that are read into the cache during by means of read I/O operations. Thus . In order to be possible to compute the values corresponding to vertices in during there must be no paths from input vertices of to vertices in , that is, must be a dominator set of .
We refer to the set of -vertices to whom at least one of the vertices in belongs as . These correspond to vertices which are active during . We consider three mutually exclusive cases:
-
(a)
At least one of the values corresponding to a vertex in is entirely computed during . That is no accumulator of the values corresponding to the leaves of is in the cache or it is loaded in it by means of an read I/O operation during . Thus, and do not include any non-leaf vertex internal to the tree-sub-CDAG rooted in . Since , by definition, corresponds to a subproblem such that . From Lemma 6, which implies .
-
(b)
: As none of the values in is entirely computed during (case (a)), and as we are considering parsimonious computations, for each value at least one partial accumulator of its value is either in the cache at the beginning (resp., end) of or is loaded into the cache during (resp., saved to the slow memory) during it. Hence .
-
(c)
: By Lemma 3, the vertices in have at least distinct -vertices predecessors. Since by definition, all such vertices corresponds to subproblems where . As the values of these vertices are either in the cache at the beginning of , or read to the cache during , or computed during , must be dominator set for the corresponding vertices. From Lemma 6, we can thus conclude from whence .
By Theorem 7, we have that for , where is a sufficiently small positive constant value, computations of CDAGs in require I/O operations. This is in contrast with the result given in Theorem 4 for schedules with no recomputation that exhibit I/O complexity for values of up to for an appropriately chosen constant .
On the tightness of the bound.
When allowing recomputation, if , where is a sufficiently small positive constant, the number of I/O operations executed by Valiant’s DP-Closure Algorithm asymptotically matches our lower bound in Theorem 7. Thus, for , allowing recomputation does not asymptotically change the required number of I/O operations. On the other hand, if , our lower bound simplifies to , diverging from the complexity of Valiant’s DP-closure Algorithm.
In Theorem 8, we show how for any algorithm corresponding to given a cache memory of suitable size , there exists an execution schedule in which the results of subproblems are computed according to a recursive top-down order that executes only I/O operations, thus asymptotically matching the general lower bound given in Theorem 7. Crucially, such a schedule involves the recomputation of intermediate results in order to reduce the number of I/O operations otherwise necessary, as established in Theorem 4.
Theorem 8.
Using a cache memory of size , there exists a computational schedule executing incurring I/O operations, where for each I/O operation it is possible to move up to memory words stored in consecutive memory locations from the cache to slow memory or vice versa.
Proof.
The proposed schedule follows the steps outlined in Algorithm 4: in the initialization phase the I/O values are read into the cache using read I/O operations. These values are held in the cache until the very end of the schedule, occupying out of the available memory locations. Once the final result is computed, it is written to the secondary memory by means of a write I/O operation.
We argue that the invocation of the procedure COMPUTE uses at most cache memory locations, thus not requiring the execution of any additional I/O operation. Let denote the memory usage of an invocation of COMPUTE. In the base case , as COMPUTE simply combines and returns the values and which are already stored in the cache after the initialization phase. In the general case, due to the use of accumulators and to the sequence of recursive calls and the memory management used by the schedule, we have:
The lemma follows as .
Interestingly, while the proposed computation schedule allows to achieve an asymptotic polynomial reduction of the number of I/O operations by a multiplicative factor using recomputation, this is at the cost of a exponential increase of the number of computation operations executed (), in comparison with Valiant’s DP-Closure Algorithm which requires only computations.
4.3 I/O analysis for notable Dynamic Programming algorithms
The general result in Section 4 can be adapted to derive asymptotically tight bounds on the I/O complexity of classic Dynamic Programming algorithms for notable problems such as Matrix Chain Multiplication, Optimal Polygon Triangulation, and the construction of optimal binary search trees. This is achieved by mapping each of these algorithms onto Prototype Algorithm 1, defining the COMBINE, AGGREGATE, and LEAST_OPTIMAL_VALUE appropriately. Due to space limitations, these results are presented in the online full version of this work [21, Appendix A].
5 I/O analysis for the Cocke-Younger-Kasami algorithm
In this section, we present I/O bounds for the Cocke-Younger-Kasami algorithm (CYK). First, we provide some background on the algorithm. We then discuss how our lower bound analysis method must be enhanced to better capture the complexity of this algorithm. Finally, we present an upper bound.
5.1 Background on the CYK algorithm
Definition 9 ([41]).
A Context-Free Grammar is a 4-tuple where
-
is a finite set called the variables;
-
is a finite set, disjoint from , called the terminals;
-
is a finite relation in , where the ∗ represents the Kleene star operation and denotes the empty string;
-
is the start variable.
Elements of are referred to as rules and are generally given in the form , where , and . If and if is a rule of the grammar, we say that yields , written as . We say that derives , written , if or if there exists a sequence for such that and . The language of a CFG is the set of all strings that can be derived from the starting variable , that is . A CFG is in Chomsky’s Normal Form (CNF) if every rule is in the form: , or , or , where is any variable in , and can be any pair in , and can be any terminal in . Any CFG can be transformed into an equivalent one in CNF through a simple iterative algorithm [41]. We refer to the number of rules as the size of a CFG. We denote the set of binary rules () in a CFG with , the set of terminal rules () with , and the number of distinct right-hand sides in with .
Given a CFG in CNF, and a string , the Cocke–Younger–
Kasami (CYK) algorithm, henceforth referred to as , decides whether is part of the language of the CFG. We refer to the standard implementation provided in Algorithm 5. Let be the set of variables that derive the substring . CYK uses a DP approach to compute all sets for using an adaptation of prototype algorithm .
determines whether is a member of the language generated by the given CFG by checking whether (i.e., whether derives ).
Memory representation model.
We assume that the variables in can be referenced by index and that such index can be stored in a single memory word. We represent subsets of (the results of subproblems ) using a one-hot encoding with memory words, where denotes the number of bits in a memory word. Thus, given the representation of one such subset in the memory, to check whether a variable is a member of the subset it is sufficient to access the -th bit of the -th memory word.
We assume that each grammar rule is represented utilizing at most three memory words each used to represent the index of the left-hand side and at most two to represent the right-hand side (either the index of the two variables or the code for a terminal symbol).
5.2 I/O lower bound
While the CYK algorithm shares the underlying structure of relations between subproblems of the prototype algorithm , it presents several non-trivial complications due to the application of the grammar rules that, in turn, require a refinement of our analysis methods and, in particular, of the CDAG construction. In this section, we will assume that the set of grammar rules are such that no two rules have the same right-hand side. While this might not always hold in practice, our analysis can still be applied by considering a subset of rules of the considered grammar which satisfies this condition.
CYK CDAG construction.
The construction of the CDAG representing CYK’s execution for an input string with characters and a CFG in CNF , called , follows a recursive structure which uses as a basis . We describe the necessary modification focusing on the sub-CDAG for a single subproblem. A visual depiction of the ensuing description is provided in Figure 3:
-
For the base of the construction, for each , the CDAG has vertices, each corresponding to the encoding of one of the possible variables of the grammar. These vertices serve as the input of the CDAG.
-
Each -vertex of , corresponding to the computation of the solution of one of the subproblems , is replaced by vertices, each representing a part of the encoding of the set of variables . We refer to this set of vertices as “variable roots” .
-
Consider a variable root corresponding to subproblem and variable . forms the root of a binary tree, which has as its leaves vertices for each rule having as its left-hand side. We denote the set of these vertices as “grammar roots” . Note that as only one variable can appear on the left-hand side of a rule, the trees composed in this way are vertex disjoint. Further, for each there will be at least such vertices.
-
For each subproblem , consider one of its grammar roots corresponding to rule . forms the root of a binary tree combining the results of “leaf-vertices”. Each leaf vertex corresponds to a combination of the subproblems and , for . Such a vertex has as predecessors the variable root vertices of subproblems and encoding the variables appearing in the right-hand side of rule .
Given the CDAG , we generalize the notion of rows (resp., columns) introduced in Section 3 by having row (resp. column ) include all vertices in corresponding to the encoding of the results of subproblems for (resp., for ).The definition of -cover generalizes to . We observe that a property analogous to that outlined in Lemma 1 holds for as well:
Lemma 10.
Given a CDAG , let denote its -cover. Any set of -vertices in contain at most interacting pairs, each of which forms the predecessors of a unique -vertex belonging to a unique -vertex.
Proof.
Consider any set of -vertices in containing vertices in column and vertices in row . Since vertices in the same row (or column) don’t interact, there are at most interacting pairs.
Consider any two such distinct interacting pairs: and (where is a -vertex corresponding to variable in row and column ). Any leaf produced by must correspond to a rule with right-hand side “AB”. From the assumption that every rule in has a unique right-hand side, it follows that there is only one such rule and therefore must produce a single leaf.
To show that the leaves produced by and belong to distinct -vertices, consider the following two cases:
-
1.
If or , it follows from Lemma 1 that the leaves produced belong to different roots, and therefore must also belong to different grammar roots.
-
2.
Otherwise, it must be the case that or (since the two interacting pairs are assumed to be distinct). It follows that the two leaves produced correspond to different rules in and therefore belong to distinct grammar roots.
I/O lower bound proof.
We analyze the I/O complexity of any sequential computation for the CYK algorithm by analyzing the CDAG corresponding to the given CFG and the input string to be considered. We first state two results that correspond each to a modification of the results in Lemma 3 and, respectively, of Lemma 6 for the family of CDAG corresponding to execution of the CYK algorithm:
Lemma 11.
Given a CDAG , let be such that vertices in belong to the tree sub-DAGs of at most distinct -vertices. The set that includes all -vertices that are predecessors of at least one vertex in has cardinality at least .
Proof.
Consider the -cover of with the adapted definition for the CYK CDAG discussed in Section 5.2. As any -vertex can be a member of at most two distinct ’s we have:
| (7) |
By definition, each vertex in has a distinct pair of -vertex predecessors which, by construction, are both included in exactly one set . Let be the number of vertices from with both predecessors in . By the assumptions we have:
| (8) |
By Lemma 10, for any , each interacting pair of -vertices in it are the predecessors of a distinct -vertex belonging to the tree sub-CDAG of a distinct -vertex. From the assumption that vertices in belong to at most distinct roots, we have:
| (9) |
Furthermore, from Lemma 10, we know that vertices in can interact to produce at most vertices in . Thus,
| (10) |
By (7), (10), (8), (9), and Lemma 2 (in this order) we have:
The proof follows a reasoning analogous to that used to prove Lemma 3 adapted for the CDAG by using the property of -vertices of the CYK CDAG states in Lemma 10 in place of Lemma 1.
Lemma 12.
Given , let be a subset of -vertices where is a -vertex corresponding to subproblem . Any dominator set of has cardinality at least .
The proof corresponds to the one of Lemma 6 where the tree sub-CDAGs rooted in -vertices are considered instead of the vertices of .
A lower bound to the I/O complexity of the CYK algorithm can be obtained by following steps analogous to those used in the proof of Theorem 7 with opportune adjustments matching the modifications in the enhanced CDAG described in this section.
Theorem 13.
Consider a string of length and a CFG in CNF. The number of I/O operations executed by when deciding whether is a member of the language of the given grammar using a machine equipped with a cache memory of size and such that contiguous can be moved with each I/O operation is bounded as:
where denotes the number of distinct right-hand sides in .
By Theorem 13, we have that for , where is a sufficiently small positive constant value, computations of the CYK algorithm require I/O operations. Note that if the considered CFG is such that our lower bound shows a direct dependence of the I/O complexity of execution of the CYK algorithm on the size of the considered grammar. Indeed there are many possible grammars for which this is the case. Finally, while the result of Theorem 7 holds for computations with recomputation, it is also possible to obtain an I/O lower bound for schedules in which recomputation is not allowed by modifying the analysis given in Theorem 4.
5.3 On the tightness of the bounds
We refine Valiant’s DP-Closure Algorithm [13] to obtain an I/O efficient recursive divide-and-conquer implementation of the CYK algorithm, henceforth referred to as . While the CYK algorithm follows the general structure of the Prototype Algorithm 1, and is hence amenable to the approach in [13], the analysis in [13] does not directly apply to the case in which the size of the given CNF cannot be considered as constant with respect to the size of the cache. Our algorithm, , manages I/O efficiently by opportunely iterating over the grammar rules and by removing redundant computations due to grammar rules with the same right-hand side.
Given an input string of length and a CFG in CNF, we use an matrix to represent the the values of the subproblems computed by the CYK algorithm (corresponding to -vertices as defined in Section 5.2): corresponds to the solution of subproblem , i.e., a sequence of length with binary values, where the index corresponding to variable represents whether variable can derive the substring of the input.
First, computes the input values corresponding to subproblems for (i.e., the values of ): Each character in the input string is compared with the right-hand side of each terminal rule in , and the appropriate values are stored. The rest of the computation involves executing Valiant’s DP-Closure Algorithm (Algorithm 2) using a modified version of Valiant’s Star Algorithm (Algorithm 3, called “Valiant’s Star Algorithm for CYK” () which is presented in Algorithm 6.
introduces two modifications over Valiant’s Star Algorithm: First, the algorithm iterates over the rules of the grammar with distinct right-hand-sides in lines , , and , calling the subroutine Matrix Multiply and Accumulate with the variable roots corresponding to each rule in the grammar. Second, to avoid computing rules with identical right-hand sides more than once, let denote a set of binary rules each having a unique right-hand side from of maximum cardinality. In this set, the corresponding left-hand sides are replaced with placeholder variables. For any subproblem, once the value corresponding to a rule in is fully computed, it can be used to update the value corresponding to all variable roots producing the same right-hand side through a rule in . This is done in lines and as the base case implies that all leaves for the subproblem have been computed.
Theorem 14.
Given an input string of length and a CFG in CNF, algorithm decides whether is a member of the language of the given grammar. When run on a machine equipped with a cache of size , the number of I/O operations executed by can be bound as:
Proof.
When computing for , compares each character of input string with the right-hand side of each terminal rule in . Doing so incurs at most I/O operations.
In the following, we use to denote the dimension of the input (i.e., ). The number of I/O operations executed by Valiant’s Star Algorithm for CYK is at most:
| (11) |
By [13, Lemma 2.1], the number of I/O operations executed by Matrix Multiply and Accumulate is:
| (12) |
Solving the recursion in (11) using (12) yields:
| (13) |
Finally, by adapting that analysis in [13] we can bound the number of I/O operations executed by the proposed algorithm as:
Expanding the recurrence, by (5.3), and adding the cost of processing the input, we have:
When , where is a sufficiently small positive constant, the lower bound in Theorem 13 simplifies to , while the upper bound in Theorem 14 becomes . Hence, if , the lower-bound asymptotically matches the upper-bound. When disallowing recomputation, a similar analysis shows our upper and lower bounds match under the same condition for , where is a sufficiently small positive constant.
6 Conclusion
This work contributed to the characterization of the I/O complexity of Dynamic Programming algorithms by establishing asymptotically tight lower bounds for a general class of DP algorithms sharing a common structure of sub-problem dependence. Our technique exploits common properties of the CDAGs corresponding to said algorithms, which makes it promising for the analysis of other families of recursive algorithms, in particular other families of DP algorithms of interest. The generality of our technique is further showcased by the ability to extend it to more complex algorithms, such as the Cocke-Younger-Kasami, for which we provide an (almost) asymptotically tight I/O lower bound and a matching algorithm.
Our analysis yields lower bounds both for computations in which no intermediate value is computed more than once and for general computations allowing recomputation. By doing so we reveal how when the size of the available cache is greater than and , schedules using recomputation can achieve an asymptotic reduction of the number of required I/O operations by a factor . This is particularly significant as in many cases of interest (e.g., Matrix Multiplication, Fast Fourier Transform) recomputation has been shown to enable a reduction of the I/O cost by at most a constant multiplicative factor.
Although it is known that recomputation can decrease the space and I/O complexity of certain CDAGs, we are still far from characterizing those CDAGs for which recomputation proves effective. This overarching objective remains a challenge for any efforts aimed at developing a general theory of the communication requirements of computations.
References
- [1] Alok Aggarwal and S Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988. doi:10.1145/48529.48535.
- [2] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 245–252, 2009.
- [3] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM), 59(6):1–23, 2013. doi:10.1145/2395116.2395121.
- [4] Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
- [5] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Strassen’s matrix multiplication with recomputation. In Workshop on Algorithms and Data Structures, pages 181–192. Springer, 2017. doi:10.1007/978-3-319-62127-2_16.
- [6] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Toom-Cook integer multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2034–2052. SIAM, 2019. doi:10.1137/1.9781611975482.123.
- [7] Gianfranco Bilardi and Lorenzo De Stefani. The DAG Visit Approach for Pebbling and I/O Lower Bounds. In In Proc. FSTTCS 2022, volume 250, pages 7:1–7:23, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.FSTTCS.2022.7.
- [8] Gianfranco Bilardi and Enoch Peserico. A characterization of temporal locality and its portability across memory hierarchies. In International Colloquium on Automata, Languages, and Programming, pages 128–139. Springer, 2001. doi:10.1007/3-540-48224-5_11.
- [9] Gianfranco Bilardi, Andrea Pietracaprina, and Paolo D’Alberto. On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop, WG 2000 Konstanz, Germany, June 15–17, 2000 Proceedings 26, pages 47–58. Springer, 2000. doi:10.1007/3-540-40064-8_6.
- [10] Gianfranco Bilardi and Franco P Preparata. Horizons of parallel computation. Journal of Parallel and Distributed Computing, 27(2):172–182, 1995. doi:10.1006/JPDC.1995.1080.
- [11] Gianfranco Bilardi and Franco P Preparata. Processor—time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory of Computing Systems, 32(5):531–559, 1999. doi:10.1007/S002240000131.
- [12] Guy E Blleloch and Yan Gu. Improved parallel cache-oblivious algorithms for dynamic programming and linear algebra. arXiv preprint, 2018. arXiv:1809.09330.
- [13] Cary Cherng and Richard E Ladner. Cache efficient simple dynamic programming. Discrete Mathematics & Theoretical Computer Science, 2005.
- [14] Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi. Provably efficient scheduling of cache-oblivious wavefront algorithms. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 339–350, 2017. doi:10.1145/3087556.3087586.
- [15] Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles E Leiserson, Armando Solar-Lezama, Bradley C Kuszmaul, and Yuan Tang. Autogen: Automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems. ACM Transactions on Parallel Computing (TOPC), 4(1):1–30, 2017. doi:10.1145/3125632.
- [16] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-oblivious dynamic programming. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600. Citeseer, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109622.
- [17] Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In Proceedings of the nineteenth annual ACM Symposium on Parallel Algorithms and Architectures, pages 71–80, 2007. doi:10.1145/1248377.1248392.
- [18] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 207–216, 2008. doi:10.1145/1378533.1378574.
- [19] Rezaul Alan Chowdhury, Hai-Son Le, and Vijaya Ramachandran. Cache-oblivious dynamic programming for bioinformatics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3):495–510, 2008.
- [20] John Cocke. Programming languages and their compilers: Preliminary notes. New York University, 1969.
- [21] Lorenzo De Stefani and Vedant Gupta. On the I/O complexity of the CYK algorithm and of a family of related DP algorithms. arXiv preprint arXiv:2410.20337, 2024. doi:10.48550/arXiv.2410.20337.
- [22] Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vassilevska Williams. Fine-grained I/O complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. arXiv preprint arXiv:1711.07960, 2017. arXiv:1711.07960.
- [23] Xiangyun Ding, Yan Gu, and Yihan Sun. Parallel and (nearly) work-efficient dynamic programming. In Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures, pages 219–232, 2024. doi:10.1145/3626183.3659958.
- [24] Aaron Dunlop, Nathan Bodenstab, and Brian Roark. Reducing the grammar constant: an analysis of CYK parsing efficiency. Technical report, Technical report CSLU-2010-02, OHSU, 2010.
- [25] Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, and Roy H Campbell. An algorithm for fast edit distance computation on GPUs. In 2012 Innovative Parallel Computing (InPar), pages 1–9. IEEE, 2012.
- [26] Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 285–297. IEEE, 1999. doi:10.1109/SFFCS.1999.814600.
- [27] Zvi Galil and Kunsoo Park. Dynamic programming with convexity, concavity and sparsity. Theoretical Computer Science, 92(1):49–76, 1992. doi:10.1016/0304-3975(92)90135-3.
- [28] Te Chiang Hu and Man Tak Shing. Computation of matrix chain products. Part I. SIAM Journal on Computing, 11(2):362–373, 1982. doi:10.1137/0211028.
- [29] Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. ACM SIGPLAN Notices, 51(10):145–164, 2016.
- [30] Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings 34, pages 143–164. Springer, 2019. doi:10.1007/978-3-030-20656-7_8.
- [31] Hong Jia-Wei. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, 1981.
- [32] Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257, 1966.
- [33] Donald E. Knuth. Optimum binary search trees. Acta informatica, 1:14–25, 1971. doi:10.1007/BF00264289.
- [34] Richard Koch, Tom Leighton, Bruce Maggs, and Satish Rao. Work-preserving emulations of fixed-connection networks. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 227–240, 1989.
- [35] Martin Lange and Hans Leiß. To CNF or not to CNF? an efficient yet presentable version of the cyk algorithm. Informatica Didactica, 8(2009):1–21, 2009. URL: http://www.informatica-didactica.de/cmsmadesimple/index.php?page=LangeLeiss2009.
- [36] J-S Park, Michael Penner, and Viktor K Prasanna. Optimizing graph algorithms for improved cache performance. IEEE Transactions on parallel and distributed systems, 15(9):769–782, 2004. doi:10.1109/TPDS.2004.44.
- [37] Cynthia A Patterson, Marc Snir, and Susan L Graham. Getting up to speed: The future of supercomputing. National Academies Press, 2005.
- [38] Vijaya Ramachandran. Cache-oblivious computation: Algorithms and experimental evaluation. In 2007 International Conference on Computing: Theory and Applications (ICCTA’07), pages 20–26. IEEE, 2007. doi:10.1109/ICCTA.2007.34.
- [39] John E Savage. Extending the Hong-Kung model to memory hierarchies. In International Computing and Combinatorics Conference, pages 270–281. Springer, 1995. doi:10.1007/BFB0030842.
- [40] Jacob Scott, Olga Holtz, and Oded Schwartz. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, pages 35–45, 2015. doi:10.1145/2755573.2755594.
- [41] Michael Sipser. Introduction to the Theory of Computation. ACM Sigact News, 27(1):27–29, 1996. doi:10.1145/230514.571645.
- [42] Xinying Song, Shilin Ding, and Chin-Yew Lin. Better binarization for the CYK parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 167–176, 2008. URL: https://aclanthology.org/D08-1018/.
- [43] Shanjiang Tang, Ce Yu, Jizhou Sun, Bu-Sung Lee, Tao Zhang, Zhen Xu, and Huabei Wu. Easypdp: An efficient parallel dynamic programming runtime system for computational biology. IEEE Transactions on Parallel and Distributed Systems, 23(5):862–872, 2011. doi:10.1109/TPDS.2011.218.
- [44] Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A Chowdhury. Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 205–214, 2015.
- [45] Leslie Valiant. General context-free recognition in less than cubic time. J. Comput. Syst. Sci., 1974.
- [46] F Frances Yao. Efficient dynamic programming using quadrangle inequalities. In Proceedings of the twelfth annual ACM symposium on Theory of computing, pages 429–435, 1980. doi:10.1145/800141.804691.
- [47] Daniel H Younger. Recognition and parsing of context-free languages in time . Information and control, 10(2):189–208, 1967. doi:10.1016/S0019-9958(67)80007-X.
