Abstract 1 Introduction 2 Preliminaries 3 CDAG construction 4 I/O analysis for computations of CDAGs in 𝓖(𝒏) 5 I/O analysis for the Cocke-Younger-Kasami algorithm 6 Conclusion References

On the I/O Complexity of the Cocke-Younger-Kasami Algorithm and of a Family of Related Dynamic Programming Algorithms

Lorenzo De Stefani111Corresponding author ORCID Department of Computer Science, Brown University, Providence, RI, USA Vedant Gupta ORCID Department of Computer Science, Brown University, Providence, RI, USA
Abstract

Asymptotically tight lower bounds are derived for the Input/Output (I/O) complexity of a class of dynamic programming algorithms, including matrix chain multiplication, optimal polygon triangulation, and the construction of optimal binary search trees. Assuming no recomputation of intermediate values, we establish an Ω(n3MB) I/O lower bound, where n denotes the size of the input and M denotes the size of the available fast memory (cache). When recomputation is allowed, we show that the same bound holds for M<cn, where c is a positive constant. In the case where M2n, we show an Ω(n/B) I/O lower bound. We also discuss algorithms for which the number of executed I/O operations matches asymptotically each of the presented lower bounds, which are thus asymptotically tight.

Additionally, we refine our general method to obtain a lower bound for the I/O complexity of the Cocke-Younger-Kasami algorithm, where the size of the grammar impacts the I/O complexity. An upper bound with asymptotically matching performance in many cases is also provided.

Keywords and phrases:
I/O complexity, Dynamic Programming Algorithms, Lower Bounds, Recomputation, Cocke-Younger-Kasami
Copyright and License:
[Uncaptioned image] © Lorenzo De Stefani and Vedant Gupta; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms
Related Version:
Full Version: https://arxiv.org/abs/2410.20337 [21]
Editors:
Pat Morin and Eunjin Oh

1 Introduction

The performance of computing systems is significantly influenced by data movement, affecting both time and energy consumption. This ongoing technological trend [37] is expected to continue as fundamental limits on minimum device size and maximum message speed lead to inherent costs associated with data transfer, be it across the levels of a hierarchical memory system or between processing elements in a parallel system [10]. While the communication requirements of algorithms have been extensively studied, deriving significant lower bounds based on these requirements and matching upper bounds remains a critical challenge.

Dynamic Programming (DP) is a classic algorithmic framework introduced by Bellman [4] that is particularly effective in optimization tasks where problems can be broken down into overlapping subproblems with optimal substructure and storing the solutions to these subproblems to prevent redundant work. DP is widely used in computer science and other important fields such as control theory, operations research, and computational biology.

In this work, we analyze the memory access cost (I/O complexity) of a family of DP algorithms when executed in a two-level storage hierarchy with M words of fast memory (cache). The analyzed algorithms share a similar structure of dependence on the results of subproblems and include, among others, the classic DP algorithms for matrix chain multiplication, boolean parenthesization, optimal polygon triangulation, and construction of optimal binary search trees. We obtain asymptotically tight lower bounds for the I/O complexity of these algorithms both for schedules in which no intermediate values are computed more than once and for general schedules allowing recomputation. Our analysis reveals that, for a range of values of the size of the cache, recomputing intermediate values can lead to an asymptotically lower number of I/O operations. We further refine our analysis to derive an I/O lower bound for the Cocke-Younger-Kasami (CYK) algorithm [20, 47, 32] and an (almost) asymptotically matching upper bound.

Previous and related work.

I/O complexity was introduced in the seminal work by Hong and Kung [31]; it denotes the number of data transfers between the two levels of a memory hierarchy with a cache of M words and a slow memory of unbounded size. Hong and Kung presented techniques to develop lower bounds for the I/O complexity of computations modeled by Computational Directed Acyclic Graphs (CDAGs).

While the “edge expansion technique” of [3], the “path routing technique” of [40], and the “closed dichotomy width” technique of [11] all yield I/O lower bounds that apply only to computational schedules for which no intermediate result is ever computed more than once (nr-computations) it is also important to investigate what can be achieved with recomputation. In fact, for some CDAGs, recomputing intermediate values reduces the space and/or the I/O complexity of an algorithm [39]. In [8], it is shown that some algorithms admit a portable schedule (i.e., a schedule that achieves optimal performance across memory hierarchies with different access costs) only if recomputation is allowed. Recomputation can also enhance the performance of simulations among networks (see [34] and references therein) and plays a key role in the design of efficient area-universal VLSI architectures. Several lower-bound techniques that allow recomputation have been presented in the literature, including the “S-partition technique” [31], the “S-span technique” [39], the “S-covering technique” [9] (which merges and extends aspects from both [31] and [39]), the “G-flow technique” [5, 6], and the “visit partition technique [7]. However, to the best of our knowledge, none of these have been previously applied to the family of DP algorithms considered in this work.

A systematic study of Dynamic Programming was introduced by Bellman [4], and has since become a popular problem-solving methodology. Galil and Park [27] produce a survey and classifications of popular dynamic programming algorithms and optimization methods. Several results have been presented in the literature aimed at optimizing the running time of Dynamic Programming algorithms [28, 33, 46, 27]. Notably, Valiant [45] showed that context-free recognition can be carried out as fast as Boolean matrix multiplication.

Cherng and Ladner [13] provide a cache-oblivious divide-and-conquer algorithm and a cache-aware blocked algorithm for simple dynamic programs, which includes matrix chain multiplication and the construction of optimal binary search trees. Their approach is based on Valiant’s algorithm [45] and incurs 𝒪(n3BM) cache misses. In [16, 17], Choudhury and Ramachandran introduce the Gaussian Elimination Paradigm (GEP) that provides cache-oblivious algorithms for problems including all-pairs shortest path. The authors also provide a way to transform simple DP problems into GEP problems. A proof of the optimality of the proposed algorithm for GEP problems is provided based on a similar result given in [31] for the matrix multiplication problem. As we show, this result does not extend to simple DP algorithms, as fewer cache misses can be achieved in some scenarios. Cache-oblivious DP algorithms for other problems have been explored [36, 19, 22]. There has also been work on parallelizing cache-efficient DP algorithms [18, 14, 30, 44, 43, 25, 23, 12], experimental evaluations of the resulting performance improvement [38], and on the automated discovery of cache-efficient DP algorithms [15, 29].

Dunlop et al. [24] provide an empirical analysis of various strategies to reduce the number of accesses to the grammar for the CYK algorithm. There have also been other works on modifying a grammar representation to improve efficiency. Notably, Song et al. [42] provides an empirical analysis of different grammar binarization approaches to improve efficiency, while Lange et al. [35] argue for the merits of a modified version of CYK to work with grammars in binary normal form instead of Chomsky normal form.

Summary of results.

In this work, we analyze the I/O complexity of a family of DP algorithms that follow the general strategy outlined in Prototype Algorithm 1 when executed in a two-level storage hierarchy with M words of cache. The analyzed algorithms share a similar structure of dependence on the results of subproblems while allowing for diverse implementation choices and optimal problem substructure.

We analyze both schedules in which no intermediate values are computed more than once (no-recomputation schedules) and general ones that allow for recomputation. For both settings, we derive an Ω(n3MB) lower bound to the number of I/O operations required by sequential execution of these algorithms on an input of size n in a two-level storage hierarchy with a cache of size Mcn, for some positive constant value c and such that B1 memory words can be moved between consecutive memory locations of the cache and of the slow memory with a single I/O operation. However, we show that while no-recomputation schedules require Ω(n3MB) I/O operations even when using a cache of size at least 2n and o(n2), schedules using recomputation achieve an asymptotic reduction of the number of required I/O operations by a factor Θ(n2/M). This is particularly significant as in many cases of interest (e.g., Matrix Multiplication [2, 5, 31], Fast Fourier Transform [31], Integer Multiplication [6]) recomputation has shown to enable a reduction of the I/O cost by at most a constant multiplicative factor. These results are obtained by analyzing a representation of the considered algorithms as Computational Directed Acyclic Graphs (CDAGs) whose structure captures the structure of dependence between subproblems, which is common to the DP algorithms considered in our work. We provide a general construction of such CDAGs and analyze their internal connection properties.

We refine our general method to obtain an Ω(n3ΓMB) lower bound for the I/O complexity of the Cocke-Younger-Kasami algorithm when deciding whether an input string of length n is a member of the language generated by a given Context-Free Grammar with Γ rules with distinct right-hand-sides not including terminals. While the CYK algorithm exhibits a structure of subproblem dependencies similar to that of previously analyzed problems, it presents several challenging differences which we address by modifying our lower-bound methods by further refining its CDAG representation. Finally, we present a cache-oblivious implementation of the CYK algorithm for which, depending on the composition of the rules of the considered CFG, the number of I/O operations to be executed almost asymptotically matches the lower bound.

2 Preliminaries

We consider a family of Dynamic Programming (DP) algorithms following the general strategy outlined in Prototype Algorithm 1: Given an input of size n, the solution, denoted as S(1,n) is computed bottom up. First, the values corresponding to the solution of subproblems of input size one, S(i,i) for i{1,2,,n}, are initialized to some set default value. The results of subproblems corresponding to parts of the input of growing size , that is S(i,i+) for i{1,2,,n} and {1,2,,ni}, are computed by first evaluating the compositions of pairs of subproblems S(i,i+k) and S(i+k+1,i+), for k{0,1,1} and then composing them according to the optimal subproblem structure used in the algorithm.

Algorithm 1 Prototype DP algorithm 𝒜.

Depending on the specific problems, the default initialization values (INITIALIZATION _VALUE in the Prototype Algorithm) of S(i,i), the way that the pairs S(i,i+k) and S(i+k+1,i+) are combined (COMBINE), and the way these results are composed to evaluate S(i,i+) (AGGREGATE) may differ. However, AGGREGATE is assumed to be commutative and associative. Besides these differences, all the considered subproblems share the subproblem dependence structure outlined in the Prototype Algorithm, for which a visual representation is provided in Figure 1(a). Examples of algorithms following such a structure include the classic DP algorithm for the matrix chain multiplication problem, the optimal convex polygon triangulation, the construction of optimal binary search trees, and the Cocke-Younger-Kasami algorithm (Algorithm 5).

We provide a general analysis of the I/O complexity of algorithms following the structure of the Prototype Algorithm 𝒜, and we then refine our analysis for the mentioned algorithms of interest.

Computational Directed Acyclic Graphs.

Our analysis is based on modeling the execution of the DP algorithms of interest as a Computational Directed Acyclic Graph (CDAG) G=(V,E). Each vertex vV represents either an input value or the result of an operation (i.e., an intermediate result or one of the output values) which is stored using a single memory word. The directed edges in E represent data dependencies. That is, vertices u,vV are connected by an edge (u,v) directed from u to v if and only if the value corresponding to u is an operand of the unit time operation computing the value corresponding to v. Vertex u is said to be a predecessor of v and v is said to be a successor of u. For a given vertex v, the set of its predecessors (resp., successors) is defined as pre(v)={uVs.t.(u,v)E} (resp., suc(v)={uVs.t.(v,u)E}. We refer to vertices with no predecessors (resp., successors) as the input (resp., output) vertices of a CDAG. We say that G=(V,E) is a sub-CDAG of G=(V,E) if VV and EE(V×V). We say that two sub-CDAGs G=(V,E) and G′′=(V′′,E′′) of G are vertex-disjoint if VV′′=.

I/O model.

We assume that sequential computations are executed on a system with a two-level memory hierarchy, consisting of a cache of size M (measured in memory words) and a slow memory of unlimited size. An operation can be executed only if all its operands are in the cache. We assume that the processor is equipped with standard elementary logic and algebraic operations. Data can be moved from the slow memory to the cache by read operations, and, in the other direction, by write operations. These operations are called I/O operations. We assume that the input data is stored in the slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed using the “red-blue pebble game” [31]. The number of I/O operations executed when evaluating a CDAG depends on the “computational schedule”, that is, on the order in which vertices are evaluated and on which values are kept in/discarded from the cache. The I/O complexity of a CDAG G corresponding to the execution of algorithm 𝒜 for an input of size n, denoted as IO𝒜(M), is defined as the minimum number of I/O operations over all possible computational schedules. We further consider a generalization of this model known as the “External Memory Model” by Aggarwal and Vitter [1] where B1 memory words can be moved between consecutive memory locations of the cache and of the slow memory with a single I/O operation.

3 CDAG construction

(a) Representation of the subproblem dependency structure for the Prototype Algorithm and other DP algorithms considered in this work for an input of size n=4.
(b) Construction of G from the simplified DAG. Root vertices are in black and leaves belonging to subproblem S(1,4) are in shown in green.
Figure 1: Construction of CDAG in 𝒢(n).

In this section, we give the construction for a family of CDAGs representing the execution of a class of DP algorithms exhibiting a subproblem optimality structure as the one outlined for the Prototype Algorithm 1.

Given input size n, the CDAG G corresponding to the execution of 𝒜 is constructed using the CDAG in Figure 1(a) as the base structure. The actual CDAG is obtained by refining the basic structure by replacing each vertex corresponding to a subproblem S(i,j) with a directed binary tree DAG with ji leaves with all edges directed towards the root. The root vertex (R-vertex for short) of the tree DAG, henceforth referred to as vi,j, corresponds to the computation of the solution of the associated subproblem S(i,j). The k-th “leaf vertex” (L-vertex for short) of the tree, for k{0,,ji1} results from the combination of subproblems S(i,i+k) and S(i+k+1,j) and has as predecessors the R-vertices of the tree DAGs corresponding respectively to S(i,i+k) and S(i+k+1,j) (i.e., vi,i+k and vi+k+1,j) (a visual representation is provided in Figure 1(b)). Notice that binary trees corresponding to different R-vertices are vertex disjoint. For each such tree DAG, we say that its ji L-vertices belong to the root vi,j.

Our lower bound results hold for any possible structure of these tree DAGs. This is a feature of our model meant to accommodate a variety of possible implementations for 𝒜 and any other DP sharing the same substructure.

As an example, for the classic DP algorithm for the matrix chain multiplication problem (presented in the appendix of the extended online version [21]), the root vertex of each tree sub-CDAG S(i,j) corresponds to the computation of the minimum number of operations required to compute the chain product of the matrices from the i-th to the j-th, and the k-th leaf, for k{0,,ji1}, corresponds to the computation of S(i,i+k)+S(i+k+1,j)+di1di+kdj.

We denote the family of CDAGs constructed in such a way with n input vertices and differing in tree sub-DAGs used to accumulate the values associated to L-vertices into the values associated with the corresponding R-vertex as 𝒢(n). This family of CDAGs captures the dependence structure between subproblems common to the family of DP algorithms that are the focus of our analysis.

Analysis of CDAG properties.

Given a CDAG G from the family 𝒢(n), let R (resp. L) denote the set its of R-vertices (resp. L-vertices). Then, |R|=n(n+1)2 and |L|=n3n6.

The “i-th row” (resp., the “j-th column”) of G, written ri (resp., cj), is defined as the subset of R including all of the R-vertices vi,j each corresponding to the computation of subproblem S(i,j) for j{i,,n} (resp., i{1,,j}). For any vR, if vri and vcj then ij. By construction, the rows (r1,r2,,rn) (resp., the columns (c1,c2,,cn)) partition R, and |ri|=ni+1 (resp., |ci|=i) for 1in. A visual representation of rows and columns can be found in Figure 2(a).

(a) Rows and columns.
(b) W-cover.
Figure 2: Representation of rows, columns, and the W-cover for input size n=4.

By construction, each L-vertex has a distinct pair of R-vertices as its predecessors. Such a pair is said to interact in the computation. Given vi,jR, it interacts with all and only the vertices vk,i1 for 1ki1, and the vertices vj+1,l for j+1ln. All such n+ij1 interactions correspond to the computation of a L-vertex belonging to a distinct sub-tree corresponding to the subproblem S(k,j) or S(i,l) (i.e., to the R-vertex vk,j or vi,l). Pairs of vertices in the same row (or column) do not interact.

We define Wi=ri+1ci, for 1in1 as the subsets of R such that each vertex in ci interacts with all the vertices of ri+1 (a visual representation can be found in Figure 2(b)). By construction, |Wi|=n. A vertex v1,jr1 (resp, vi,ncn) is only a member of Wj (resp., Wi1). All remaining vertices vi,jR are members of both Wj and Wi1, which are distinct as, by construction, for any vi,jR we have ij. We refer to the collection of sets (W1,W2,,Wn1) as the “W-cover” of the R-vertices of CDAG G. By its definition, and by the properties of the interactions between vertices according to placement in rows and columns discussed in the previous paragraphs, we have the following property:

Lemma 1.

Given a CDAG G𝒢(n), let W1,W2,,Wn1 denote the W-cover of its R-vertices. For any set XWi there are at most |X|24n24 pairs of vertices (u,v), where uri+1 and vci, that interact as predecessors of distinct L-vertices each of which belongs to a distinct R-vertex.

4 I/O analysis for computations of CDAGs in 𝓖(𝒏)

We analyze the I/O complexity of CDAGs in the class 𝒢(n) according to the Red-Blue Pebble Game by Hong and Kung [31]. To simplify the presentation, we assume that n and M are powers of 2, i.e., n=2i and M=2j for integers i,j3. If that is not the case, our analysis can be adapted with minor, albeit tedious, modifications resulting in different constant terms.

4.1 I/O lower bound for computations with no recomputation

In this section, we present a lower bound to the I/O complexity of algorithms corresponding to CDAGs in 𝒢(n), assuming that no intermediate value is ever recomputed. (i.e., under the “no-recomputation assumption”). We first introduce a useful technical Lemma:

Lemma 2.

Given nN+ non-negative real numbers a1,a2,,an such that i=1nai2T, where T+, and maxi=1nai2L, where L+ and L<T, then, i=1naiTL.

Proof.

The lower bound comes from the intuitive strategy of making each ai as large as possible, i.e. if q=TL, then assign a1,,aq=L and aq+1=TqL. In this case, i=1q+1ai=qL+TqL=qL+TLqLqL+(TLq)L (as TLq<1)=TL. To show the optimality of this strategy, we show that any other configuration is suboptimal.

Consider any other assignment of a1,,an that is not just a permutation of the same values. By assumption, any such assignment must have at least two non-zero values ai,aj<L. Let k=ai+aj. If kL, since k2>ai2+aj2, there exists some c+ such that c2=ai2+aj2 and c<ai+aj, showing that our assignment a1,,an does not minimize i=1nai.

Alternatively, if k>L, then ai2+aj2=ai2+(kai)2<L2+(kL)2 (the last step follows from the fact that f(x)=x2+(kx)2 defines an upwards-facing parabola with an axis of symmetry about x=k2). This means that there exists some c<kL such that L2+c2=a2+b2 and L+c<a+b, again showing that our assignment a1,,an does not minimize i=1nai.

The following property captures the relation between subsets of L-vertices and their number of distinct R-vertex predecessors:

Lemma 3.

Let LL such that vertices in L belong to the tree sub-CDAGs of at most 0<ρ|L| distinct R-vertices. The set RR containing all R-vertices that are predecessors of at least one vertex in L has cardinality at least |L|/ρ.

Proof.

Consider the W-cover of G as defined in Section 3. As any R-vertex can be a member of at most two distinct Wi’s we have:

|R|12i=1n1|RWi|. (1)

By definition, each vertex in L has a distinct pair of R-vertex predecessors which, by construction, are both included in exactly one set Wi. Let ai be the number of vertices from L with both predecessors in Wi. By the assumptions we have:

i=1n1ai=|L| (2)

By Lemma 1, for any Wi, each interacting pair of R-vertices in it are the predecessors of a distinct L-vertex belonging to the tree sub-CDAG of a distinct R-vertex. From the assumption that vertices in L belong to at most ρ distinct roots, we have:

maxi=1,,n1aiρ, (3)

Furthermore, from Lemma 1, we know that vertices in |RWi| can interact to produce at most |RWi|2/4 vertices in L. Thus,

i=1n1|RWi|24i=1n1aii=1n1|RWi|i=1n12ai (4)

By (1), (4),  (2), (3), and Lemma 2 (in this order) we have:

|R|12i=1n1|RWi|i=1n1ai|L|ρ.

Lemma 3 allows us to claim that to compute a set of L-vertices, it is necessary to access a minimum number of R vertices, which, due to the non-recomputation assumption, must either be in the cache or loaded into it using a read I/O operation.

Theorem 4.

For any CDAG G=(V,E)𝒢(n) let 𝒜 denote an algorithm corresponding to it. The I/O complexity of an Algorithm 𝒜 when run without recomputing any intermediate value using a cache memory of size M and where for each I/O operation it is possible to move up to B memory words stored in consecutive memory locations from the cache to slow memory or vice versa, is:

IOG(n,M,B)max{(n3n)16Mn(n+1)23M,n}1B (5)
Proof.

We prove the result for B=1. The result then trivially generalizes for a generic B.

The fact that IOG(n,M,1)n follows trivially from our assumption that the input is initially stored in the slow memory and therefore must be loaded into the cache at least once using n read I/O operations.

Let 𝒞 be any computation schedule for the sequential execution of 𝒜 using a cache of size M in which no value corresponding to any vertex of G is computed more than once. We partition 𝒞 into non-overlapping segments 𝒞1,𝒞2, such that during each 𝒞i exactly 8M1.5 values corresponding to distinct L-vertices, denoted as Li, are evaluated. Since |L|=(n3n)/6 there are (n36n6)18M1.5 such segments. Given Li, we refer to the set of R-vertices to whom at least one of the vertices in Li belongs as Ri. These correspond to vertices which are active during 𝒞i. For each interval, we denote as hi the number of R-vertices whose value is completely evaluated during 𝒞i. As no value is computed more than once ihi=|R|=n(n+1)/2. Let gi denote the number of I/O operations executed during 𝒞i. We consider two cases:

  • The vertices of Li belong to at least 4M+1 distinct R-vertices: For each of the 4M+1hi values corresponding to R-vertrices that were partially computed in 𝒞i, a partial accumulator of the values corresponding to their leaves computed during 𝒞i must either be in the cache at the end of 𝒞i or must be written into slow memory by means of a write I/O operation. Hence, gi4M+1Mhi.

  • The vertices of Li belong to at most 4M distinct R-vertices: By Lemma 3, the vertices in Li have at least 4M R-vertex predecessors, of whom at most hi are computed during 𝒞i. As the considered schedule’s values are not computed more than once, each of the remaining 4Mhi values corresponding to predecessor vertices of Li must either be in the cache at the beginning of 𝒞i, or read into the cache by means of a read I/O operation. Hence, gi4MMhi.

The I/O requirement of the algorithm is obtained by combining the cost of each non-overlapping segment:

IOG(n,M,1) i=1n3n48M1.5gii=1n3n48M1.5(3Mhi)n3n48M1.53Mi=1n3n48M1.5hi
(n3n)16M3M|R|.

By Theorem 4, we have that for Mcn2, for a sufficiently small constant value c>0, computations of CDAGs in 𝒢(n) in which no value is recomputed require Ω(n3MB) I/O operations.

On the tightness of the bound

In [13], Cherng and Ladner presented the Valiant’s DP-Closure Algorithm, henceforth referred to as VDP, which allows computing algorithms following the general structure outlined in Prototype Algorithm 1 cache-obliviously and without recomputation. When M<|R|=n(n+1)2, VDP executes 𝒪(n3MB) I/O operations, and for M|R| it executes 𝒪(n/B) I/O operations. Thus, we can conclude that our I/O lower bound for schedules with no recomputation provided in Theorem 4 is asymptotically tight.

VDP utilizes a divide and conquer approach, making recursive calls to smaller subproblems. Here we give a simplified presentation of the Valiant’s DP-Closure Algorithm. Please refer to Cherng and Ladner [13] for a detailed presentation.

Consider any algorithm following the Prototype Algorithm 1 with input of size n1, where we assume n=2a for a. All subproblems S(i,j) computed by the algorithm are associated with entries of an n×n matrix 𝐗 so that S(i,j) is represented at 𝐗[i,j+1]. Initially, the values X[i,i+1] corresponding to values S(i,i) for 1in1 are computed from the input while the remaining entries in 𝐗 are set to the LEAST_OPTIMAL_VALUE (e.g., 0). To fully compute all the subproblems (i.e., to compute the DP-closure), VDP splits 𝐗 into 16 equally-sized submatrices of dimension n4222We assume that n is a power of 2. This can always be made the case by padding 𝐗 labeled as follows:

𝐗=[𝐗11𝐗12𝐗13𝐗14𝐗22𝐗23𝐗24𝐗33𝐗34𝐗44]

We omit the representation of the lower-triangular submatrices as these will have all of their entries set to 0. Valiant’s DP-Closure Algorithm (whose pseudocode is presented in Algorithm 2) is given the matrix 𝐗 as input, with the only non-zero values corresponding to subproblems S(i,i) for 1in1. The values of the S(i,j) (i.e., the R-vertices) corresponding to the top-left and bottom-right submatrices are computed through recursive calls, exploiting the fact that these submatrices correspond to a smaller instantiation of the same DP problem. The remaining values in 𝐗 are then computed by invoking a subroutine called “Valiant’s Star Algorithm” (VSTAR).

Algorithm 2 Valiant’s DP-Closure Algorithm (VDP).

VSTAR is given as input a matrix 𝐗 where the subproblems in the top-left and bottom-right quadrants are already computed. The algorithm then recursively computes the values corresponding to the remaining entries of 𝐗 (i.e., the values of the R-vertices). The pseudocode for this function can be found in Algorithm 3. We represent COMBINE with “” and AGGREGATE with “+”. Operations involving these operators in Lines 6,8,10, and 11 are computed using a subroute called “Matrix multiply and accumulate” that, given three equally sized square matrices 𝐀, 𝐁, and 𝐂, computes A+BC. In the context of VDP, 𝐀 contains partially computed values S(i,j)’s (i.e., corresponding to R-vertices), and 𝐁𝐂 represents the computation of combinations of subproblems contributing to S(i,j) (i.e., S(i,k)S(k+1,j) for ik<j), that is, the values corresponding to the L-vertices, which are then used to update 𝐀. A recursive in-place implementation of this function is presented in detail in [13, 26].

Algorithm 3 Valiant’s Star Algorithm (VSTAR).

4.2 I/O lower bound for computations with allowed recomputation

We extend our I/O lower bound analysis to general computations in which values may be recomputed by combining it with the dominator I/O lower bound technique by Hong and Kung [31].

Definition 5 ([31]).

Given G=(V,E), a set DV is a dominator set for VV if every path directed from any input vertex of G to a vertex in V contains a vertex of D.

Dominator sets of R-vertices of a CDAG G𝒢(n) must satisfy the following property:

Lemma 6.

Given G=(V,E)𝒢(n), let A={vi1,j1,vi2,j2,} be a subset of R-vertices where vik,jk is the R-vertex corresponding to subproblem S(ik,jk). Any dominator set D of A has minimum cardinality |D|min(|A|,minjkik)

Proof.

Suppose for every vi,jA, D contains a vertex internal to the tree-CDAG rooted at vik,jk. As the tree-CDAGs of different R-vertices are disjoint, this would imply |D||A|.

Otherwise, there must be some vi,jA, such that D does not contain any of its internal vertices. Let Li,j={l0,l1,,lji1} denote the set of L-vertices corresponding to vi,j. It is sufficient to show that there are ji vertex disjoint paths from the inputs of G to vertices in Li,j. By construction, each leaf lkLi,j has a predecessor belonging to a unique column ci+k. In turn, each of these predecessors is connected through its internal tree-CDAG to the input vertex in the same column (vi+k,i+k). In this manner, ji paths (vi+k,i+kvi,i+klk) for 0kji1 are obtained. Since each path contains R-vertices from a different column, and the tree-CDAGs of different R-vertices are disjoint, the paths outlined above are vertex-disjoint.

Theorem 7.

For any CDAG G=(V,E)𝒢(n) let 𝒜 denote an algorithm corresponding to it. The I/O-complexity of 𝒜 when run using a cache memory of size M and where for each I/O operation it is possible to move up to B memory words stored in consecutive memory locations from cache to slow memory or vice versa, is:

IOG(n,M,B)max{(n6M1)318M2M,n}1B (6)
Proof.

To simplify our analysis, we only consider parsimonious execution schedules such that each time an intermediate result is computed, the result is then used to compute at least one of the values of which it is an operand before being removed from the memory (either the cache or slow memory). Any non-parsimonious schedule 𝒞 can be reduced to a parsimonious schedule 𝒞 by removing all the steps that violate the definition of parsimonious computation. 𝒞 has therefore less computational or I/O operations than 𝒞. Hence, restricting the analysis to parsimonious computations leads to no loss of generality.

We prove the result for the case B=1. The result then trivially generalizes for a generic B. The fact that IO𝒜(n,M,1)n follows trivially from our assumption that input is initially stored in slow memory and therefore must be loaded into the cache at least once using n read I/O operations.

Let L(x) be the set of L-vertices in G whose predecessor R-vertices both correspond to some subproblems S(i,j) such that xji. By the construction, for each subproblem S(i,j) the corresponding tree sub-CDAG has max{0,ji2x} such L-vertices. Since the sub-CDAGs corresponding to each S(i,j) do not share vertices, we have:

|L(x)| =ji=1n1(n(ji))max{0,ji2x}
=ji=2x+1n1(n(ji))(ji2x)
=y=1n12x(n2xy)y
=(n2x1)(n2x)(n2x+1)6
>(n2x1)36.

Let 𝒞 be any computation schedule for the sequential execution of 𝒜 using a cache of size M. We partition 𝒞 into segments 𝒞1,𝒞2, such that during each 𝒞l exactly 6M1.5 values corresponding to distinct vertices in L(3M) are computed from their operands (i.e., not read from the cache). Let Ll denote the set of vertices corresponding to these values. Below we argue that the number gl of I/O operations executed during each of the |L(3M)|/6M1.5 non-overlapping segments 𝒞l is at least 2M, from whence the statement follows.

Let A denote the set of vertices corresponding to the values computed during 𝒞l. Clearly LlA. Let D denote the set of vertices corresponding to either the at most M values that are either stored in the cache or the values that are read into the cache during 𝒞l by means of read I/O operations. Thus gl=|D|M. In order to be possible to compute the values corresponding to vertices in A during 𝒞l there must be no paths from input vertices of G to vertices in A, that is, D must be a dominator set of A.

We refer to the set of R-vertices to whom at least one of the vertices in Ll belongs as Rl. These correspond to vertices which are active during 𝒞l. We consider three mutually exclusive cases:

  1. (a)

    At least one of the values corresponding to a vertex v in Rl is entirely computed during 𝒞l. That is no accumulator of the values corresponding to the leaves of v is in the cache or it is loaded in it by means of an read I/O operation during 𝒞l. Thus, vA and D do not include any non-leaf vertex internal to the tree-sub-CDAG rooted in v. Since LlL(3M), by definition, v corresponds to a subproblem S(i,j) such that ji3M. From Lemma 6, |D|3M which implies gj2M.

  2. (b)

    |Rl|>4M: As none of the values in Rl is entirely computed during 𝒞l (case (a)), and as we are considering parsimonious computations, for each value Rl at least one partial accumulator of its value is either in the cache at the beginning (resp., end) of 𝒞l or is loaded into the cache during (resp., saved to the slow memory) during it. Hence gl4M2M.

  3. (c)

    |Rl|4M: By Lemma 3, the vertices in Ll have at least 3M distinct R-vertices predecessors. Since LlL(3M) by definition, all such R vertices corresponds to subproblems S(i,j) where ji3M. As the values of these vertices are either in the cache at the beginning of 𝒞l, or read to the cache during 𝒞l, or computed during 𝒞l, D must be dominator set for the corresponding vertices. From Lemma 6, we can thus conclude |D|3M from whence gl2M.

By Theorem 7, we have that for Mc1n, where c1 is a sufficiently small positive constant value, computations of CDAGs in 𝒢(n) require Ω(n3MB) I/O operations. This is in contrast with the result given in Theorem 4 for schedules with no recomputation that exhibit I/O complexity Ω(n3MB) for values of M up to c2n2 for an appropriately chosen constant c2.

On the tightness of the bound.

When allowing recomputation, if M<c1n, where c1 is a sufficiently small positive constant, the number of I/O operations executed by Valiant’s DP-Closure Algorithm asymptotically matches our lower bound in Theorem 7. Thus, for M<c1n, allowing recomputation does not asymptotically change the required number of I/O operations. On the other hand, if M>2n, our lower bound simplifies to Ω(n/B), diverging from the complexity of Valiant’s DP-closure Algorithm.

In Theorem 8, we show how for any algorithm corresponding to G=𝒢(n) given a cache memory of suitable size M2n, there exists an execution schedule in which the results of subproblems are computed according to a recursive top-down order that executes only n/B+1 I/O operations, thus asymptotically matching the general lower bound given in Theorem 7. Crucially, such a schedule involves the recomputation of intermediate results in order to reduce the number of I/O operations otherwise necessary, as established in Theorem 4.

Algorithm 4 Top-Down recursive computation schedule for an algorithm 𝒜 corresponding to G𝒢(n).
Theorem 8.

Using a cache memory of size M2n, there exists a computational schedule executing 𝒜 incurring n/B+1 I/O operations, where for each I/O operation it is possible to move up to B memory words stored in consecutive memory locations from the cache to slow memory or vice versa.

Proof.

The proposed schedule follows the steps outlined in Algorithm 4: in the initialization phase the n I/O values are read into the cache using n/B read I/O operations. These values are held in the cache until the very end of the schedule, occupying n out of the M2n available memory locations. Once the final result S(1,n) is computed, it is written to the secondary memory by means of a write I/O operation.

We argue that the invocation of the procedure COMPUTE(i,n) uses at most n cache memory locations, thus not requiring the execution of any additional I/O operation. Let (ji) denote the memory usage of an invocation of COMPUTE(i,j). In the base case (1)=1, as COMPUTE simply combines and returns the values S(i,i) and S(j,j) which are already stored in the cache after the initialization phase. In the general case, due to the use of accumulators and to the sequence of recursive calls and the memory management used by the schedule, we have:

(ji) =1+max{(ji1),1+(ji2))}
=1+(ji1)
=ji1+(1)
=ji.

The lemma follows as (n1)=n1.

Interestingly, while the proposed computation schedule allows to achieve an asymptotic polynomial reduction of the number of I/O operations by a Θ(n2M) multiplicative factor using recomputation, this is at the cost of a exponential increase of the number of computation operations executed (2𝒪(n)), in comparison with Valiant’s DP-Closure Algorithm which requires only 𝒪(n3) computations.

4.3 I/O analysis for notable Dynamic Programming algorithms

The general result in Section 4 can be adapted to derive asymptotically tight bounds on the I/O complexity of classic Dynamic Programming algorithms for notable problems such as Matrix Chain Multiplication, Optimal Polygon Triangulation, and the construction of optimal binary search trees. This is achieved by mapping each of these algorithms onto Prototype Algorithm 1, defining the COMBINE, AGGREGATE, and LEAST_OPTIMAL_VALUE appropriately. Due to space limitations, these results are presented in the online full version of this work [21, Appendix A].

5 I/O analysis for the Cocke-Younger-Kasami algorithm

In this section, we present I/O bounds for the Cocke-Younger-Kasami algorithm (CYK). First, we provide some background on the algorithm. We then discuss how our lower bound analysis method must be enhanced to better capture the complexity of this algorithm. Finally, we present an upper bound.

5.1 Background on the CYK algorithm

Definition 9 ([41]).

A Context-Free Grammar is a 4-tuple (V,,Σ,T) where

  • V is a finite set called the variables;

  • Σ is a finite set, disjoint from V, called the terminals;

  • is a finite relation in V×(VΣϵ), where the represents the Kleene star operation and ϵ denotes the empty string;

  • TV is the start variable.

Elements of are referred to as rules and are generally given in the form Aw, where AV, and w(VΣϵ). If u,v,w(VΣϵ) and if Aw is a rule of the grammar, we say that uAv yields uwv, written as uAvuwv. We say that u derives v, written uv, if u=v or if there exists a sequence u1,u2,,uk for k1 such that ui(VΣϵ) and uu1u2ukv. The language of a CFG is the set of all strings wΣ that can be derived from the starting variable T, that is {wΣs.t.Tw}. A CFG is in Chomsky’s Normal Form (CNF) if every rule is in the form: ABC, or Aa, or Tϵ, where A is any variable in V, B and C can be any pair in (VS)×(VS), and a can be any terminal in Σ. Any CFG can be transformed into an equivalent one in CNF through a simple iterative algorithm [41]. We refer to the number of rules || as the size of a CFG. We denote the set of binary rules (ABC) in a CFG with B, the set of terminal rules (Aa) with T, and the number of distinct right-hand sides in B with Γ.

Given a CFG (V,,Σ,T) in CNF, and a string w=w1w2wn, the Cocke–Younger–
Kasami (CYK) algorithm, henceforth referred to as 𝒜CYK, decides whether w is part of the language of the CFG. We refer to the standard implementation provided in Algorithm 5. Let S(i,j) be the set of variables that derive the substring wiwi+1wj. CYK uses a DP approach to compute all sets S(i,j) for 1ijn using an adaptation of prototype algorithm 𝒜. 𝒜CYK determines whether w is a member of the language generated by the given CFG by checking whether TS(1,n) (i.e., whether T derives w).

Algorithm 5 Psudocode for the Cocke-Younger-Kasami algorithm.
Memory representation model.

We assume that the variables in V can be referenced by index and that such index can be stored in a single memory word. We represent subsets of V (the results of subproblems S(i,j)) using a one-hot encoding with |V|/s memory words, where s denotes the number of bits in a memory word. Thus, given the representation of one such subset in the memory, to check whether a variable Vi is a member of the subset it is sufficient to access the (imods)-th bit of the i/s-th memory word.

We assume that each grammar rule is represented utilizing at most three memory words each used to represent the index of the left-hand side and at most two to represent the right-hand side (either the index of the two variables or the code for a terminal symbol).

5.2 I/O lower bound

While the CYK algorithm shares the underlying structure of relations between subproblems of the prototype algorithm 𝒜, it presents several non-trivial complications due to the application of the grammar rules that, in turn, require a refinement of our analysis methods and, in particular, of the CDAG construction. In this section, we will assume that the set of grammar rules are such that no two rules have the same right-hand side. While this might not always hold in practice, our analysis can still be applied by considering a subset of rules of the considered grammar which satisfies this condition.

Figure 3: Each subproblem S(i,j) is represented by multiple variable roots (in black), which is the result of the composition multiple grammar roots (in purple), which in turn are the composition of multiple leaves (in green).
CYK CDAG construction.

The construction of the CDAG representing CYK’s execution for an input string w with n characters and a CFG in CNF (V,,Σ,T), called GCYK(n,(V,,Σ,T)), follows a recursive structure which uses as a basis G=(V,E)𝒢(n). We describe the necessary modification focusing on the sub-CDAG for a single S(i,j) subproblem. A visual depiction of the ensuing description is provided in Figure 3:

  • For the base of the construction, for each S(i,i), the CDAG has |V|/s vertices, each corresponding to the encoding of one of the possible variables of the grammar. These vertices serve as the input of the CDAG.

  • Each R-vertex of G, corresponding to the computation of the solution of one of the subproblems S(i,j), is replaced by |V|/s vertices, each representing a part of the encoding of the set of variables V. We refer to this set of vertices as “variable roots𝑉𝑅.

  • Consider a variable root v corresponding to subproblem S(i,j) and variable AV. v forms the root of a binary tree, which has as its leaves vertices for each rule having A as its left-hand side. We denote the set of these vertices as “grammar roots𝐺𝑅. Note that as only one variable can appear on the left-hand side of a rule, the trees composed in this way are vertex disjoint. Further, for each S(i,j)) there will be at least Γ/s such vertices.

  • For each subproblem S(i,j), consider one of its grammar roots u corresponding to rule γB. u forms the root of a binary tree combining the results of ji “leaf-vertices”. Each leaf vertex corresponds to a combination of the subproblems S(i,k) and S(k+1,j), for ik<j. Such a vertex has as predecessors the variable root vertices of subproblems S(i,k) and S(k+1,j) encoding the variables appearing in the right-hand side of rule γ.

Given the CDAG GCYK(n,(V,,Σ,T)), we generalize the notion of rows (resp., columns) introduced in Section 3 by having row ri (resp. column cj) include all vertices in 𝑉𝑅 corresponding to the encoding of the results of subproblems S(i,j) for 1jn (resp., for 1in).The definition of W-cover generalizes to GCYK(n,(V,,Σ,T)). We observe that a property analogous to that outlined in Lemma 1 holds for GCYK(n,(V,,Σ,T)) as well:

Lemma 10.

Given a CDAG GCYK(n,(V,,Σ,T)), let W1,,Wn1 denote its W-cover. Any set of x 𝑉𝑅-vertices in Wi contain at most x2/4 interacting pairs, each of which forms the predecessors of a unique L-vertex belonging to a unique 𝐺𝑅-vertex.

Proof.

Consider any set of x 𝑉𝑅-vertices in Wi containing xc vertices in column i and xxc vertices in row i+1. Since vertices in the same row (or column) don’t interact, there are at most x(xxc)x2/4 interacting pairs.

Consider any two such distinct interacting pairs: (vk,iA,vi+1,lB) and (vm,iC,vi+1,nD) (where vk,iA is a 𝑉𝑅-vertex corresponding to variable A in row k and column i). Any leaf produced by (vk,iA,vi+1,lB) must correspond to a rule with right-hand side “AB”. From the assumption that every rule in has a unique right-hand side, it follows that there is only one such rule and therefore (vk,iA,vi+1,lB) must produce a single leaf.

To show that the leaves produced by (vk,iA,vi+1,lB) and (vm,iC,vi+1,nD) belong to distinct GR-vertices, consider the following two cases:

  1. 1.

    If km or li, it follows from Lemma 1 that the leaves produced belong to different roots, and therefore must also belong to different grammar roots.

  2. 2.

    Otherwise, it must be the case that AC or BD (since the two interacting pairs are assumed to be distinct). It follows that the two leaves produced correspond to different rules in and therefore belong to distinct grammar roots.

I/O lower bound proof.

We analyze the I/O complexity of any sequential computation for the CYK algorithm by analyzing the CDAG corresponding to the given CFG and the input string to be considered. We first state two results that correspond each to a modification of the results in Lemma 3 and, respectively, of Lemma 6 for the family of CDAG corresponding to execution of the CYK algorithm:

Lemma 11.

Given a CDAG GCYK(n,(V,,Σ,T)), let LL be such that vertices in L belong to the tree sub-DAGs of at most ρ distinct 𝐺𝑅-vertices. The set R𝑉𝑅 that includes all 𝑉𝑅-vertices that are predecessors of at least one vertex in L has cardinality at least |L|/ρ.

Proof.

Consider the W-cover of GCYK(n,(V,,Σ,T)) with the adapted definition for the CYK CDAG discussed in Section 5.2. As any VR-vertex can be a member of at most two distinct Wi’s we have:

|R|12i=1n1|RWi|. (7)

By definition, each vertex in L has a distinct pair of VR-vertex predecessors which, by construction, are both included in exactly one set Wi. Let ai be the number of vertices from L with both predecessors in Wi. By the assumptions we have:

i=1n1ai=|L| (8)

By Lemma 10, for any Wi, each interacting pair of VR-vertices in it are the predecessors of a distinct L-vertex belonging to the tree sub-CDAG of a distinct GR-vertex. From the assumption that vertices in L belong to at most ρ distinct roots, we have:

maxi=1,,n1aiρ, (9)

Furthermore, from Lemma 10, we know that vertices in |RWi| can interact to produce at most |RWi|2/4 vertices in L. Thus,

i=1n1|RWi|24i=1n1aii=1n1|RWi|i=1n12ai (10)

By (7), (10),  (8), (9), and Lemma 2 (in this order) we have:

|R|12i=1n1|RWi|i=1n1ai|L|ρ.

The proof follows a reasoning analogous to that used to prove Lemma 3 adapted for the CDAG GCYK(n,(V,,Σ,T)) by using the property of L-vertices of the CYK CDAG states in Lemma 10 in place of Lemma 1.

Lemma 12.

Given GCYK(n,(V,,Σ,T)), let A={vi1,j1,vi2,j2,} be a subset of GR-vertices where vil,jk is a GR-vertex corresponding to subproblem S(ik,jk). Any dominator set D of A has cardinality at least |D|min(|A|,minjkik).

The proof corresponds to the one of Lemma 6 where the tree sub-CDAGs rooted in GR-vertices are considered instead of the R vertices of 𝒢(n).

A lower bound to the I/O complexity of the CYK algorithm can be obtained by following steps analogous to those used in the proof of Theorem 7 with opportune adjustments matching the modifications in the enhanced CDAG described in this section.

Theorem 13.

Consider a string w of length n and a CFG (V,,Σ,T) in CNF. The number of I/O operations executed by 𝒜CYK when deciding whether w is a member of the language of the given grammar using a machine equipped with a cache memory of size M and such that B contiguous can be moved with each I/O operation is bounded as:

IOGCYK(n,m,B)max{(n6M1)318MΓ2MΓ,n}1B

where Γ denotes the number of distinct right-hand sides in B.

By Theorem 13, we have that for Mc1n, where c1 is a sufficiently small positive constant value, computations of the CYK algorithm require Ω(n3ΓMB) I/O operations. Note that if the considered CFG is such that Γ=Θ(|B|) our lower bound shows a direct dependence of the I/O complexity of execution of the CYK algorithm on the size of the considered grammar. Indeed there are many possible grammars for which this is the case. Finally, while the result of Theorem 7 holds for computations with recomputation, it is also possible to obtain an I/O lower bound for schedules in which recomputation is not allowed by modifying the analysis given in Theorem 4.

5.3 On the tightness of the bounds

We refine Valiant’s DP-Closure Algorithm [13] to obtain an I/O efficient recursive divide-and-conquer implementation of the CYK algorithm, henceforth referred to as 𝒜CYK. While the CYK algorithm follows the general structure of the Prototype Algorithm 1, and is hence amenable to the approach in [13], the analysis in [13] does not directly apply to the case in which the size of the given CNF cannot be considered as constant with respect to the size of the cache. Our algorithm, 𝒜CYK, manages I/O efficiently by opportunely iterating over the grammar rules and by removing redundant computations due to grammar rules with the same right-hand side.

Given an input string w of length n1 and a CFG (V,,Σ,T) in CNF, we use an n×n matrix 𝐗 to represent the the values of the subproblems computed by the CYK algorithm (corresponding to R-vertices as defined in Section 5.2): 𝐗[i,j+1] corresponds to the solution of subproblem S(i,j), i.e., a sequence of length |V| with binary values, where the index corresponding to variable AV represents whether variable A can derive the substring wiwi+1wj of the input.

First, 𝒜CYK computes the input values corresponding to subproblems S(i,i) for 1in (i.e., the values of 𝐗[i,i+1]): Each character in the input string w is compared with the right-hand side of each terminal rule in T, and the appropriate values are stored. The rest of the computation involves executing Valiant’s DP-Closure Algorithm (Algorithm 2) using a modified version of Valiant’s Star Algorithm (Algorithm 3, called “Valiant’s Star Algorithm for CYK” (𝒜CYK) which is presented in Algorithm 6.

Algorithm 6 𝒜CYK, Valiant’s Star Algorithm for CYK.

𝒜CYK introduces two modifications over Valiant’s Star Algorithm: First, the algorithm iterates over the rules of the grammar with distinct right-hand-sides in lines 8, 11, and 14, calling the subroutine Matrix Multiply and Accumulate with the variable roots corresponding to each rule in the grammar. Second, to avoid computing rules with identical right-hand sides more than once, let BΓ denote a set of binary rules each having a unique right-hand side from B of maximum cardinality. In this set, the corresponding left-hand sides are replaced with placeholder variables. For any subproblem, once the value corresponding to a rule in BΓ is fully computed, it can be used to update the value corresponding to all variable roots producing the same right-hand side through a rule in B. This is done in lines 4 and 5 as the base case implies that all leaves for the subproblem have been computed.

Theorem 14.

Given an input string w of length n and a CFG (V,,Σ,T) in CNF, algorithm 𝒜CYK decides whether w is a member of the language of the given grammar. When run on a machine equipped with a cache of size M, the number of I/O operations executed by 𝒜CYK can be bound as:

IO𝒜CYK(n,M)𝒪((n3ΓM+n2ΓlogM+n2|B|+n|T|)1B).
Proof.

When computing S(i,i) for 1in, 𝒜CYK compares each character of input string w with the right-hand side of each terminal rule in T. Doing so incurs at most O(n||T/B) I/O operations.

In the following, we use n to denote the dimension of the input (i.e., n=dim(X)). The number of I/O operations executed by Valiant’s Star Algorithm for CYK is at most:

IO𝒜CYK(n){𝒪(|B|B)if n2,4IO𝒜CYK(n2)+4Γ(IOMMA(n4)+O(1))otherwise (11)

By [13, Lemma 2.1], the number of I/O operations executed by Matrix Multiply and Accumulate is:

IOMMA(n){𝒪(n2B)if n2M,𝒪(n3BM)otherwise (12)

Solving the recursion in (11) using (12) yields:

IO𝒜CYK(n) 𝒪(n3ΓBM)+4log(n/M)𝒪(MΓBlogM+M|B|B)
𝒪(n3ΓBM+n2ΓBlogM+n2|B|B) (13)

Finally, by adapting that analysis in [13] we can bound the number of I/O operations executed by the proposed algorithm 𝒜CYK as:

IO𝒜CYK(n){𝒪(1)if n2,2IO𝒜CYK(n/2)+IO𝒜CYK(n)+𝒪(1)otherwise

Expanding the recurrence, by (5.3), and adding the cost of processing the input, we have:

IO𝒜CYK(n)𝒪(n3ΓBM+n2ΓBlogM+n2|B|B+n|T|B)

When M<c1n, where c1 is a sufficiently small positive constant, the lower bound in Theorem 13 simplifies to Ω(n3ΓBM), while the upper bound in Theorem 14 becomes 𝒪(n3ΓBM+n2|B|B+n|T|B). Hence, if (n|B|+|T|)/Γ𝒪(n2/M), the lower-bound asymptotically matches the upper-bound. When disallowing recomputation, a similar analysis shows our upper and lower bounds match under the same condition for M<c2n, where c2 is a sufficiently small positive constant.

6 Conclusion

This work contributed to the characterization of the I/O complexity of Dynamic Programming algorithms by establishing asymptotically tight lower bounds for a general class of DP algorithms sharing a common structure of sub-problem dependence. Our technique exploits common properties of the CDAGs corresponding to said algorithms, which makes it promising for the analysis of other families of recursive algorithms, in particular other families of DP algorithms of interest. The generality of our technique is further showcased by the ability to extend it to more complex algorithms, such as the Cocke-Younger-Kasami, for which we provide an (almost) asymptotically tight I/O lower bound and a matching algorithm.

Our analysis yields lower bounds both for computations in which no intermediate value is computed more than once and for general computations allowing recomputation. By doing so we reveal how when the size of the available cache is greater than 2n and o(n2), schedules using recomputation can achieve an asymptotic reduction of the number of required I/O operations by a factor Θ(n2/M). This is particularly significant as in many cases of interest (e.g., Matrix Multiplication, Fast Fourier Transform) recomputation has been shown to enable a reduction of the I/O cost by at most a constant multiplicative factor.

Although it is known that recomputation can decrease the space and I/O complexity of certain CDAGs, we are still far from characterizing those CDAGs for which recomputation proves effective. This overarching objective remains a challenge for any efforts aimed at developing a general theory of the communication requirements of computations.

References

  • [1] Alok Aggarwal and S Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988. doi:10.1145/48529.48535.
  • [2] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 245–252, 2009.
  • [3] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM), 59(6):1–23, 2013. doi:10.1145/2395116.2395121.
  • [4] Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
  • [5] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Strassen’s matrix multiplication with recomputation. In Workshop on Algorithms and Data Structures, pages 181–192. Springer, 2017. doi:10.1007/978-3-319-62127-2_16.
  • [6] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Toom-Cook integer multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2034–2052. SIAM, 2019. doi:10.1137/1.9781611975482.123.
  • [7] Gianfranco Bilardi and Lorenzo De Stefani. The DAG Visit Approach for Pebbling and I/O Lower Bounds. In In Proc. FSTTCS 2022, volume 250, pages 7:1–7:23, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.FSTTCS.2022.7.
  • [8] Gianfranco Bilardi and Enoch Peserico. A characterization of temporal locality and its portability across memory hierarchies. In International Colloquium on Automata, Languages, and Programming, pages 128–139. Springer, 2001. doi:10.1007/3-540-48224-5_11.
  • [9] Gianfranco Bilardi, Andrea Pietracaprina, and Paolo D’Alberto. On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop, WG 2000 Konstanz, Germany, June 15–17, 2000 Proceedings 26, pages 47–58. Springer, 2000. doi:10.1007/3-540-40064-8_6.
  • [10] Gianfranco Bilardi and Franco P Preparata. Horizons of parallel computation. Journal of Parallel and Distributed Computing, 27(2):172–182, 1995. doi:10.1006/JPDC.1995.1080.
  • [11] Gianfranco Bilardi and Franco P Preparata. Processor—time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory of Computing Systems, 32(5):531–559, 1999. doi:10.1007/S002240000131.
  • [12] Guy E Blleloch and Yan Gu. Improved parallel cache-oblivious algorithms for dynamic programming and linear algebra. arXiv preprint, 2018. arXiv:1809.09330.
  • [13] Cary Cherng and Richard E Ladner. Cache efficient simple dynamic programming. Discrete Mathematics & Theoretical Computer Science, 2005.
  • [14] Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi. Provably efficient scheduling of cache-oblivious wavefront algorithms. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 339–350, 2017. doi:10.1145/3087556.3087586.
  • [15] Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles E Leiserson, Armando Solar-Lezama, Bradley C Kuszmaul, and Yuan Tang. Autogen: Automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems. ACM Transactions on Parallel Computing (TOPC), 4(1):1–30, 2017. doi:10.1145/3125632.
  • [16] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-oblivious dynamic programming. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600. Citeseer, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109622.
  • [17] Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In Proceedings of the nineteenth annual ACM Symposium on Parallel Algorithms and Architectures, pages 71–80, 2007. doi:10.1145/1248377.1248392.
  • [18] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 207–216, 2008. doi:10.1145/1378533.1378574.
  • [19] Rezaul Alan Chowdhury, Hai-Son Le, and Vijaya Ramachandran. Cache-oblivious dynamic programming for bioinformatics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3):495–510, 2008.
  • [20] John Cocke. Programming languages and their compilers: Preliminary notes. New York University, 1969.
  • [21] Lorenzo De Stefani and Vedant Gupta. On the I/O complexity of the CYK algorithm and of a family of related DP algorithms. arXiv preprint arXiv:2410.20337, 2024. doi:10.48550/arXiv.2410.20337.
  • [22] Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vassilevska Williams. Fine-grained I/O complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. arXiv preprint arXiv:1711.07960, 2017. arXiv:1711.07960.
  • [23] Xiangyun Ding, Yan Gu, and Yihan Sun. Parallel and (nearly) work-efficient dynamic programming. In Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures, pages 219–232, 2024. doi:10.1145/3626183.3659958.
  • [24] Aaron Dunlop, Nathan Bodenstab, and Brian Roark. Reducing the grammar constant: an analysis of CYK parsing efficiency. Technical report, Technical report CSLU-2010-02, OHSU, 2010.
  • [25] Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, and Roy H Campbell. An algorithm for fast edit distance computation on GPUs. In 2012 Innovative Parallel Computing (InPar), pages 1–9. IEEE, 2012.
  • [26] Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 285–297. IEEE, 1999. doi:10.1109/SFFCS.1999.814600.
  • [27] Zvi Galil and Kunsoo Park. Dynamic programming with convexity, concavity and sparsity. Theoretical Computer Science, 92(1):49–76, 1992. doi:10.1016/0304-3975(92)90135-3.
  • [28] Te Chiang Hu and Man Tak Shing. Computation of matrix chain products. Part I. SIAM Journal on Computing, 11(2):362–373, 1982. doi:10.1137/0211028.
  • [29] Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. ACM SIGPLAN Notices, 51(10):145–164, 2016.
  • [30] Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings 34, pages 143–164. Springer, 2019. doi:10.1007/978-3-030-20656-7_8.
  • [31] Hong Jia-Wei. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, 1981.
  • [32] Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257, 1966.
  • [33] Donald E. Knuth. Optimum binary search trees. Acta informatica, 1:14–25, 1971. doi:10.1007/BF00264289.
  • [34] Richard Koch, Tom Leighton, Bruce Maggs, and Satish Rao. Work-preserving emulations of fixed-connection networks. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 227–240, 1989.
  • [35] Martin Lange and Hans Leiß. To CNF or not to CNF? an efficient yet presentable version of the cyk algorithm. Informatica Didactica, 8(2009):1–21, 2009. URL: http://www.informatica-didactica.de/cmsmadesimple/index.php?page=LangeLeiss2009.
  • [36] J-S Park, Michael Penner, and Viktor K Prasanna. Optimizing graph algorithms for improved cache performance. IEEE Transactions on parallel and distributed systems, 15(9):769–782, 2004. doi:10.1109/TPDS.2004.44.
  • [37] Cynthia A Patterson, Marc Snir, and Susan L Graham. Getting up to speed: The future of supercomputing. National Academies Press, 2005.
  • [38] Vijaya Ramachandran. Cache-oblivious computation: Algorithms and experimental evaluation. In 2007 International Conference on Computing: Theory and Applications (ICCTA’07), pages 20–26. IEEE, 2007. doi:10.1109/ICCTA.2007.34.
  • [39] John E Savage. Extending the Hong-Kung model to memory hierarchies. In International Computing and Combinatorics Conference, pages 270–281. Springer, 1995. doi:10.1007/BFB0030842.
  • [40] Jacob Scott, Olga Holtz, and Oded Schwartz. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, pages 35–45, 2015. doi:10.1145/2755573.2755594.
  • [41] Michael Sipser. Introduction to the Theory of Computation. ACM Sigact News, 27(1):27–29, 1996. doi:10.1145/230514.571645.
  • [42] Xinying Song, Shilin Ding, and Chin-Yew Lin. Better binarization for the CYK parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 167–176, 2008. URL: https://aclanthology.org/D08-1018/.
  • [43] Shanjiang Tang, Ce Yu, Jizhou Sun, Bu-Sung Lee, Tao Zhang, Zhen Xu, and Huabei Wu. Easypdp: An efficient parallel dynamic programming runtime system for computational biology. IEEE Transactions on Parallel and Distributed Systems, 23(5):862–872, 2011. doi:10.1109/TPDS.2011.218.
  • [44] Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A Chowdhury. Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 205–214, 2015.
  • [45] Leslie Valiant. General context-free recognition in less than cubic time. J. Comput. Syst. Sci., 1974.
  • [46] F Frances Yao. Efficient dynamic programming using quadrangle inequalities. In Proceedings of the twelfth annual ACM symposium on Theory of computing, pages 429–435, 1980. doi:10.1145/800141.804691.
  • [47] Daniel H Younger. Recognition and parsing of context-free languages in time n3. Information and control, 10(2):189–208, 1967. doi:10.1016/S0019-9958(67)80007-X.