On the I/O Complexity of the Cocke-Younger-Kasami Algorithm and of a Family of Related Dynamic Programming Algorithms

De Stefani, Lorenzo; Gupta, Vedant

doi:10.4230/LIPIcs.WADS.2025.49

On the I/O Complexity of the Cocke-Younger-Kasami Algorithm and of a Family of Related Dynamic Programming Algorithms

Lorenzo De Stefani¹¹1Corresponding author

Department of Computer Science, Brown University, Providence, RI, USA Vedant Gupta

Department of Computer Science, Brown University, Providence, RI, USA

Abstract

Asymptotically tight lower bounds are derived for the Input/Output (I/O) complexity of a class of dynamic programming algorithms, including matrix chain multiplication, optimal polygon triangulation, and the construction of optimal binary search trees. Assuming no recomputation of intermediate values, we establish an $\Omega\left(\frac{n^{3}}{\sqrt{M}B}\right)$ I/O lower bound, where $n$ denotes the size of the input and $M$ denotes the size of the available fast memory (cache). When recomputation is allowed, we show that the same bound holds for $M<cn$ , where $c$ is a positive constant. In the case where $M\geq 2n$ , we show an $\Omega\left(n/B\right)$ I/O lower bound. We also discuss algorithms for which the number of executed I/O operations matches asymptotically each of the presented lower bounds, which are thus asymptotically tight.

Additionally, we refine our general method to obtain a lower bound for the I/O complexity of the Cocke-Younger-Kasami algorithm, where the size of the grammar impacts the I/O complexity. An upper bound with asymptotically matching performance in many cases is also provided.

Keywords and phrases:

I/O complexity, Dynamic Programming Algorithms, Lower Bounds, Recomputation, Cocke-Younger-Kasami

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Design and analysis of algorithms

Editors:

Pat Morin and Eunjin Oh

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

The performance of computing systems is significantly influenced by data movement, affecting both time and energy consumption. This ongoing technological trend [37] is expected to continue as fundamental limits on minimum device size and maximum message speed lead to inherent costs associated with data transfer, be it across the levels of a hierarchical memory system or between processing elements in a parallel system [10]. While the communication requirements of algorithms have been extensively studied, deriving significant lower bounds based on these requirements and matching upper bounds remains a critical challenge.

Dynamic Programming (DP) is a classic algorithmic framework introduced by Bellman [4] that is particularly effective in optimization tasks where problems can be broken down into overlapping subproblems with optimal substructure and storing the solutions to these subproblems to prevent redundant work. DP is widely used in computer science and other important fields such as control theory, operations research, and computational biology.

In this work, we analyze the memory access cost (I/O complexity) of a family of DP algorithms when executed in a two-level storage hierarchy with $M$ words of fast memory (cache). The analyzed algorithms share a similar structure of dependence on the results of subproblems and include, among others, the classic DP algorithms for matrix chain multiplication, boolean parenthesization, optimal polygon triangulation, and construction of optimal binary search trees. We obtain asymptotically tight lower bounds for the I/O complexity of these algorithms both for schedules in which no intermediate values are computed more than once and for general schedules allowing recomputation. Our analysis reveals that, for a range of values of the size of the cache, recomputing intermediate values can lead to an asymptotically lower number of I/O operations. We further refine our analysis to derive an I/O lower bound for the Cocke-Younger-Kasami (CYK) algorithm [20, 47, 32] and an (almost) asymptotically matching upper bound.

Previous and related work.

I/O complexity was introduced in the seminal work by Hong and Kung [31]; it denotes the number of data transfers between the two levels of a memory hierarchy with a cache of $M$ words and a slow memory of unbounded size. Hong and Kung presented techniques to develop lower bounds for the I/O complexity of computations modeled by Computational Directed Acyclic Graphs (CDAGs).

While the “edge expansion technique” of [3], the “path routing technique” of [40], and the “closed dichotomy width” technique of [11] all yield I/O lower bounds that apply only to computational schedules for which no intermediate result is ever computed more than once (nr-computations) it is also important to investigate what can be achieved with recomputation. In fact, for some CDAGs, recomputing intermediate values reduces the space and/or the I/O complexity of an algorithm [39]. In [8], it is shown that some algorithms admit a portable schedule (i.e., a schedule that achieves optimal performance across memory hierarchies with different access costs) only if recomputation is allowed. Recomputation can also enhance the performance of simulations among networks (see [34] and references therein) and plays a key role in the design of efficient area-universal VLSI architectures. Several lower-bound techniques that allow recomputation have been presented in the literature, including the “S-partition technique” [31], the “S-span technique” [39], the “S-covering technique” [9] (which merges and extends aspects from both [31] and [39]), the “G-flow technique” [5, 6], and the “visit partition technique [7]. However, to the best of our knowledge, none of these have been previously applied to the family of DP algorithms considered in this work.

A systematic study of Dynamic Programming was introduced by Bellman [4], and has since become a popular problem-solving methodology. Galil and Park [27] produce a survey and classifications of popular dynamic programming algorithms and optimization methods. Several results have been presented in the literature aimed at optimizing the running time of Dynamic Programming algorithms [28, 33, 46, 27]. Notably, Valiant [45] showed that context-free recognition can be carried out as fast as Boolean matrix multiplication.

Cherng and Ladner [13] provide a cache-oblivious divide-and-conquer algorithm and a cache-aware blocked algorithm for simple dynamic programs, which includes matrix chain multiplication and the construction of optimal binary search trees. Their approach is based on Valiant’s algorithm [45] and incurs $\mathcal{O}(\frac{n^{3}}{B\sqrt{M}})$ cache misses. In [16, 17], Choudhury and Ramachandran introduce the Gaussian Elimination Paradigm (GEP) that provides cache-oblivious algorithms for problems including all-pairs shortest path. The authors also provide a way to transform simple DP problems into GEP problems. A proof of the optimality of the proposed algorithm for GEP problems is provided based on a similar result given in [31] for the matrix multiplication problem. As we show, this result does not extend to simple DP algorithms, as fewer cache misses can be achieved in some scenarios. Cache-oblivious DP algorithms for other problems have been explored [36, 19, 22]. There has also been work on parallelizing cache-efficient DP algorithms [18, 14, 30, 44, 43, 25, 23, 12], experimental evaluations of the resulting performance improvement [38], and on the automated discovery of cache-efficient DP algorithms [15, 29].

Dunlop et al. [24] provide an empirical analysis of various strategies to reduce the number of accesses to the grammar for the CYK algorithm. There have also been other works on modifying a grammar representation to improve efficiency. Notably, Song et al. [42] provides an empirical analysis of different grammar binarization approaches to improve efficiency, while Lange et al. [35] argue for the merits of a modified version of CYK to work with grammars in binary normal form instead of Chomsky normal form.

Summary of results.

In this work, we analyze the I/O complexity of a family of DP algorithms that follow the general strategy outlined in Prototype Algorithm 1 when executed in a two-level storage hierarchy with $M$ words of cache. The analyzed algorithms share a similar structure of dependence on the results of subproblems while allowing for diverse implementation choices and optimal problem substructure.

We analyze both schedules in which no intermediate values are computed more than once (no-recomputation schedules) and general ones that allow for recomputation. For both settings, we derive an $\Omega\left(\frac{n^{3}}{\sqrt{M}B}\right)$ lower bound to the number of I/O operations required by sequential execution of these algorithms on an input of size $n$ in a two-level storage hierarchy with a cache of size $M\leq cn$ , for some positive constant value $c$ and such that $B\geq 1$ memory words can be moved between consecutive memory locations of the cache and of the slow memory with a single I/O operation. However, we show that while no-recomputation schedules require $\Omega\left(\frac{n^{3}}{\sqrt{M}B}\right)$ I/O operations even when using a cache of size at least $2n$ and $o\left(n^{2}\right)$ , schedules using recomputation achieve an asymptotic reduction of the number of required I/O operations by a factor $\Theta\left(n^{2}/\sqrt{M}\right)$ . This is particularly significant as in many cases of interest (e.g., Matrix Multiplication [2, 5, 31], Fast Fourier Transform [31], Integer Multiplication [6]) recomputation has shown to enable a reduction of the I/O cost by at most a constant multiplicative factor. These results are obtained by analyzing a representation of the considered algorithms as Computational Directed Acyclic Graphs (CDAGs) whose structure captures the structure of dependence between subproblems, which is common to the DP algorithms considered in our work. We provide a general construction of such CDAGs and analyze their internal connection properties.

We refine our general method to obtain an $\Omega\left(\frac{n^{3}\Gamma}{\sqrt{M}B}\right)$ lower bound for the I/O complexity of the Cocke-Younger-Kasami algorithm when deciding whether an input string of length $n$ is a member of the language generated by a given Context-Free Grammar with $\Gamma$ rules with distinct right-hand-sides not including terminals. While the CYK algorithm exhibits a structure of subproblem dependencies similar to that of previously analyzed problems, it presents several challenging differences which we address by modifying our lower-bound methods by further refining its CDAG representation. Finally, we present a cache-oblivious implementation of the CYK algorithm for which, depending on the composition of the rules of the considered CFG, the number of I/O operations to be executed almost asymptotically matches the lower bound.

2 Preliminaries

We consider a family of Dynamic Programming (DP) algorithms following the general strategy outlined in Prototype Algorithm 1: Given an input of size $n$ , the solution, denoted as $S(1,n)$ is computed bottom up. First, the values corresponding to the solution of subproblems of input size one, $S(i,i)$ for $i\in\{1,2,\ldots,n\}$ , are initialized to some set default value. The results of subproblems corresponding to parts of the input of growing size $\ell$ , that is $S(i,i+\ell)$ for $i\in\{1,2,\ldots,n\}$ and $\ell\in\{1,2,\ldots,n-i\}$ , are computed by first evaluating the $\ell$ compositions of pairs of subproblems $S(i,i+k)$ and $S(i+k+1,i+\ell)$ , for $k\in\{0,1,\ldots\ell-1\}$ and then composing them according to the optimal subproblem structure used in the algorithm.

Algorithm 1 Prototype DP algorithm

\mathcal{A}^{*}

.

Depending on the specific problems, the default initialization values (INITIALIZATION _VALUE in the Prototype Algorithm) of $S(i,i)$ , the way that the pairs $S(i,i+k)$ and $S(i+k+1,i+\ell)$ are combined (COMBINE), and the way these results are composed to evaluate $S(i,i+\ell)$ (AGGREGATE) may differ. However, AGGREGATE is assumed to be commutative and associative. Besides these differences, all the considered subproblems share the subproblem dependence structure outlined in the Prototype Algorithm, for which a visual representation is provided in Figure 1(a). Examples of algorithms following such a structure include the classic DP algorithm for the matrix chain multiplication problem, the optimal convex polygon triangulation, the construction of optimal binary search trees, and the Cocke-Younger-Kasami algorithm (Algorithm 5).

We provide a general analysis of the I/O complexity of algorithms following the structure of the Prototype Algorithm $\mathcal{A}^{*}$ , and we then refine our analysis for the mentioned algorithms of interest.

Computational Directed Acyclic Graphs.

Our analysis is based on modeling the execution of the DP algorithms of interest as a Computational Directed Acyclic Graph (CDAG) $G=\left(V,E\right)$ . Each vertex $v\in V$ represents either an input value or the result of an operation (i.e., an intermediate result or one of the output values) which is stored using a single memory word. The directed edges in $E$ represent data dependencies. That is, vertices $u,v\in V$ are connected by an edge $(u,v)$ directed from $u$ to $v$ if and only if the value corresponding to $u$ is an operand of the unit time operation computing the value corresponding to $v$ . Vertex $u$ is said to be a predecessor of $v$ and $v$ is said to be a successor of $u$ . For a given vertex $v$ , the set of its predecessors (resp., successors) is defined as $pre(v)=\{u\in V\ s.t.(u,v)\in E\}$ (resp., $suc(v)=\{u\in V\ s.t.(v,u)\in E\}$ . We refer to vertices with no predecessors (resp., successors) as the input (resp., output) vertices of a CDAG. We say that $G^{\prime}=\left(V^{\prime},E^{\prime}\right)$ is a sub-CDAG of $G=\left(V,E\right)$ if $V^{\prime}\subseteq V$ and $E^{\prime}\subseteq E\cap(V^{\prime}\times V^{\prime})$ . We say that two sub-CDAGs $G^{\prime}=\left(V^{\prime},E^{\prime}\right)$ and $G^{\prime\prime}=\left(V^{\prime\prime},E^{\prime\prime}\right)$ of $G$ are vertex-disjoint if $V^{\prime}\cap V^{\prime\prime}=\emptyset$ .

I/O model.

We assume that sequential computations are executed on a system with a two-level memory hierarchy, consisting of a cache of size $M$ (measured in memory words) and a slow memory of unlimited size. An operation can be executed only if all its operands are in the cache. We assume that the processor is equipped with standard elementary logic and algebraic operations. Data can be moved from the slow memory to the cache by read operations, and, in the other direction, by write operations. These operations are called I/O operations. We assume that the input data is stored in the slow memory at the beginning of the computation. The evaluation of a CDAG in this model can be analyzed using the “red-blue pebble game” [31]. The number of I/O operations executed when evaluating a CDAG depends on the “computational schedule”, that is, on the order in which vertices are evaluated and on which values are kept in/discarded from the cache. The I/O complexity of a CDAG $G$ corresponding to the execution of algorithm $\mathcal{A}$ for an input of size $n$ , denoted as $IO_{\mathcal{A}}(M)$ , is defined as the minimum number of I/O operations over all possible computational schedules. We further consider a generalization of this model known as the “External Memory Model” by Aggarwal and Vitter [1] where $B\geq 1$ memory words can be moved between consecutive memory locations of the cache and of the slow memory with a single I/O operation.

3 CDAG construction

(a) Representation of the subproblem dependency structure for the Prototype Algorithm and other DP algorithms considered in this work for an input of size

n=4

.

(b) Construction of

G

from the simplified DAG. Root vertices are in black and leaves belonging to subproblem

S(1,4)

are in shown in green.

Figure 1: Construction of CDAG in

\mathcal{G}\left(n\right)

.

In this section, we give the construction for a family of CDAGs representing the execution of a class of DP algorithms exhibiting a subproblem optimality structure as the one outlined for the Prototype Algorithm 1.

Given input size $n$ , the CDAG $G$ corresponding to the execution of $\mathcal{A}^{*}$ is constructed using the CDAG in Figure 1(a) as the base structure. The actual CDAG is obtained by refining the basic structure by replacing each vertex corresponding to a subproblem $S(i,j)$ with a directed binary tree DAG with $j-i$ leaves with all edges directed towards the root. The root vertex ( $R$ -vertex for short) of the tree DAG, henceforth referred to as $v_{i,j}$ , corresponds to the computation of the solution of the associated subproblem $S(i,j)$ . The $k$ -th “leaf vertex” ( $L$ -vertex for short) of the tree, for $k\in\{0,\ldots,j-i-1\}$ results from the combination of subproblems $S(i,i+k)$ and $S(i+k+1,j)$ and has as predecessors the $R$ -vertices of the tree DAGs corresponding respectively to $S(i,i+k)$ and $S(i+k+1,j)$ (i.e., $v_{i,i+k}$ and $v_{i+k+1,j})$ (a visual representation is provided in Figure 1(b)). Notice that binary trees corresponding to different $R$ -vertices are vertex disjoint. For each such tree DAG, we say that its $j-i$ $L$ -vertices belong to the root $v_{i,j}$ .

Our lower bound results hold for any possible structure of these tree DAGs. This is a feature of our model meant to accommodate a variety of possible implementations for $\mathcal{A}^{*}$ and any other DP sharing the same substructure.

As an example, for the classic DP algorithm for the matrix chain multiplication problem (presented in the appendix of the extended online version [21]), the root vertex of each tree sub-CDAG $S(i,j)$ corresponds to the computation of the minimum number of operations required to compute the chain product of the matrices from the $i$ -th to the $j$ -th, and the $k$ -th leaf, for $k\in\{0,\ldots,j-i-1\}$ , corresponds to the computation of $S(i,i+k)+S(i+k+1,j)+d_{i-1}d_{i+k}d_{j}$ .

We denote the family of CDAGs constructed in such a way with $n$ input vertices and differing in tree sub-DAGs used to accumulate the values associated to $L$ -vertices into the values associated with the corresponding $R$ -vertex as $\mathcal{G}(n)$ . This family of CDAGs captures the dependence structure between subproblems common to the family of DP algorithms that are the focus of our analysis.

Analysis of CDAG properties.

Given a CDAG $G$ from the family $\mathcal{G}\left(n\right)$ , let $R$ (resp. $L$ ) denote the set its of $R$ -vertices (resp. $L$ -vertices). Then, $|R|=\frac{n(n+1)}{2}$ and $|L|=\frac{n^{3}-n}{6}$ .

The “ $i$ -th row” (resp., the “ $j$ -th column”) of $G$ , written $r_{i}$ (resp., $c_{j}$ ), is defined as the subset of $R$ including all of the $R$ -vertices $v_{i,j}$ each corresponding to the computation of subproblem $S(i,j)$ for $j\in\{i,\ldots,n\}$ (resp., $i\in\{1,\ldots,j\}$ ). For any $v\in R$ , if $v\in r_{i}$ and $v\in c_{j}$ then $i\leq j$ . By construction, the rows $(r_{1},r_{2},\ldots,r_{n})$ (resp., the columns $(c_{1},c_{2},\ldots,c_{n})$ ) partition $R$ , and $|r_{i}|=n-i+1$ (resp., $|c_{i}|=i$ ) for $1\leq i\leq n$ . A visual representation of rows and columns can be found in Figure 2(a).

(a) Rows and columns.

(b)

W

-cover.

Figure 2: Representation of rows, columns, and the

W

-cover for input size

n=4

.

By construction, each $L$ -vertex has a distinct pair of $R$ -vertices as its predecessors. Such a pair is said to interact in the computation. Given $v_{i,j}\in R$ , it interacts with all and only the vertices $v_{k,i-1}$ for $1\leq k\leq i-1$ , and the vertices $v_{j+1,l}$ for $j+1\leq l\leq n$ . All such $n+i-j-1$ interactions correspond to the computation of a $L$ -vertex belonging to a distinct sub-tree corresponding to the subproblem $S(k,j)$ or $S(i,l)$ (i.e., to the $R$ -vertex $v_{k,j}$ or $v_{i,l})$ . Pairs of vertices in the same row (or column) do not interact.

We define $W_{i}=r_{i+1}\cup c_{i}$ , for $1\leq i\leq n-1$ as the subsets of $R$ such that each vertex in $c_{i}$ interacts with all the vertices of $r_{i+1}$ (a visual representation can be found in Figure 2(b)). By construction, $|W_{i}|=n$ . A vertex $v_{1,j}\in r_{1}$ (resp, $v_{i,n}\in c_{n}$ ) is only a member of $W_{j}$ (resp., $W_{i-1}$ ). All remaining vertices $v_{i,j}\in R$ are members of both $W_{j}$ and $W_{i-1}$ , which are distinct as, by construction, for any $v_{i,j}\in R$ we have $i\leq j$ . We refer to the collection of sets $\left(W_{1},W_{2},\ldots,W_{n-1}\right)$ as the “ $W$ -cover” of the $R$ -vertices of CDAG $G$ . By its definition, and by the properties of the interactions between vertices according to placement in rows and columns discussed in the previous paragraphs, we have the following property:

Lemma 1.

Given a CDAG $G\in\mathcal{G}\left(n\right)$ , let $W_{1},W_{2},\ldots,W_{n-1}$ denote the $W$ -cover of its $R$ -vertices. For any set $X\subseteq W_{i}$ there are at most $\frac{|X|^{2}}{4}\leq\frac{n^{2}}{4}$ pairs of vertices $(u,v)$ , where $u\in r_{i+1}$ and $v\in c_{i}$ , that interact as predecessors of distinct $L$ -vertices each of which belongs to a distinct $R$ -vertex.

4 I/O analysis for computations of CDAGs in $\mathcal{G}(n)$

We analyze the I/O complexity of CDAGs in the class $\mathcal{G}(n)$ according to the Red-Blue Pebble Game by Hong and Kung [31]. To simplify the presentation, we assume that $n$ and $M$ are powers of $2$ , i.e., $n=2^{i}$ and $M=2^{j}$ for integers $i,j\geq 3$ . If that is not the case, our analysis can be adapted with minor, albeit tedious, modifications resulting in different constant terms.

4.1 I/O lower bound for computations with no recomputation

In this section, we present a lower bound to the I/O complexity of algorithms corresponding to CDAGs in $\mathcal{G}\left(n\right)$ , assuming that no intermediate value is ever recomputed. (i.e., under the “no-recomputation assumption”). We first introduce a useful technical Lemma:

Lemma 2.

Given $n\in N^{+}$ non-negative real numbers $a_{1},a_{2},\ldots,a_{n}$ such that $\sum_{i=1}^{n}a_{i}^{2}\geq T$ , where $T\in\mathbb{R}^{+}$ , and $\max_{i=1}^{n}a_{i}^{2}\leq L$ , where $L\in\mathbb{R}^{+}$ and $L<T$ , then, $\sum_{i=1}^{n}a_{i}\geq\frac{T}{\sqrt{L}}$ .

Proof.

The lower bound comes from the intuitive strategy of making each $a_{i}$ as large as possible, i.e. if $q=\lfloor\frac{T}{L}\rfloor$ , then assign $a_{1},...,a_{q}=\sqrt{L}$ and $a_{q+1}=\sqrt{T-qL}$ . In this case, $\sum_{i=1}^{q+1}a_{i}=q\sqrt{L}+\sqrt{T-qL}=q\sqrt{L}+\sqrt{\frac{T}{L}-q}% \sqrt{L}\geq q\sqrt{L}+(\frac{T}{L}-q)\sqrt{L}\text{ (as $\frac{T}{L}-q<1$)}=% \frac{T}{\sqrt{L}}$ . To show the optimality of this strategy, we show that any other configuration is suboptimal.

Consider any other assignment of $a_{1},...,a_{n}$ that is not just a permutation of the same values. By assumption, any such assignment must have at least two non-zero values $a_{i},a_{j}<\sqrt{L}$ . Let $k=a_{i}+a_{j}$ . If $k\leq\sqrt{L}$ , since $k^{2}>a_{i}^{2}+a_{j}^{2}$ , there exists some $c\in\mathbb{R}^{+}$ such that $c^{2}=a_{i}^{2}+a_{j}^{2}$ and $c<a_{i}+a_{j}$ , showing that our assignment $a_{1},...,a_{n}$ does not minimize $\sum_{i=1}^{n}a_{i}$ .

Alternatively, if $k>\sqrt{L}$ , then $a_{i}^{2}+a_{j}^{2}=a_{i}^{2}+(k-a_{i})^{2}<\sqrt{L}^{2}+(k-\sqrt{L})^{2}$ (the last step follows from the fact that $f(x)=x^{2}+(k-x)^{2}$ defines an upwards-facing parabola with an axis of symmetry about $x=\frac{k}{2}$ ). This means that there exists some $c<k-\sqrt{L}$ such that $\sqrt{L}^{2}+c^{2}=a^{2}+b^{2}$ and $L+c<a+b$ , again showing that our assignment $a_{1},...,a_{n}$ does not minimize $\sum_{i=1}^{n}a_{i}$ . $\hfill\blacktriangleleft$

The following property captures the relation between subsets of $L$ -vertices and their number of distinct $R$ -vertex predecessors:

Lemma 3.

Let $L^{\prime}\subset L$ such that vertices in $L^{\prime}$ belong to the tree sub-CDAGs of at most $0<\rho\leq|L^{\prime}|$ distinct $R$ -vertices. The set $R^{\prime}\subseteq R$ containing all $R$ -vertices that are predecessors of at least one vertex in $L^{\prime}$ has cardinality at least $\lfloor|L^{\prime}|/\sqrt{\rho}\rfloor$ .

Proof.

Consider the $W$ -cover of $G$ as defined in Section 3. As any $R$ -vertex can be a member of at most two distinct $W_{i}$ ’s we have:

|R^{\prime}|\geq\frac{1}{2}\sum_{i=1}^{n-1}|R^{\prime}\cap W_{i}|.

(1)

By definition, each vertex in $L$ has a distinct pair of $R$ -vertex predecessors which, by construction, are both included in exactly one set $W_{i}$ . Let $a_{i}$ be the number of vertices from $L^{\prime}$ with both predecessors in $W_{i}$ . By the assumptions we have:

\sum_{i=1}^{n-1}a_{i}=|L^{\prime}|

(2)

By Lemma 1, for any $W_{i}$ , each interacting pair of $R$ -vertices in it are the predecessors of a distinct $L$ -vertex belonging to the tree sub-CDAG of a distinct $R$ -vertex. From the assumption that vertices in $L^{\prime}$ belong to at most $\rho$ distinct roots, we have:

\max_{i=1,\ldots,n-1}a_{i}\leq\rho,

(3)

Furthermore, from Lemma 1, we know that vertices in $|R^{\prime}\cap W_{i}|$ can interact to produce at most $|R^{\prime}\cap W_{i}|^{2}/4$ vertices in $L^{\prime}$ . Thus,

\sum_{i=1}^{n-1}\frac{|R^{\prime}\cap W_{i}|^{2}}{4}\geq\sum_{i=1}^{n-1}a_{i}% \Rightarrow\sum_{i=1}^{n-1}|R^{\prime}\cap W_{i}|\geq\sum_{i=1}^{n-1}2\sqrt{a_% {i}}

(4)

By (1), (4), (2), (3), and Lemma 2 (in this order) we have:

|R^{\prime}|\geq\frac{1}{2}\sum_{i=1}^{n-1}|R^{\prime}\cap W_{i}|\geq\sum_{i=1% }^{n-1}\sqrt{a_{i}}\geq\frac{|L^{\prime}|}{\sqrt{\rho}}.\

$\hfill\blacktriangleleft$ Lemma 3 allows us to claim that to compute a set of $L$ -vertices, it is necessary to access a minimum number of $R$ vertices, which, due to the non-recomputation assumption, must either be in the cache or loaded into it using a read I/O operation.

Theorem 4.

For any CDAG $G=(V,E)\in\mathcal{G}(n)$ let $\mathcal{A}$ denote an algorithm corresponding to it. The I/O complexity of an Algorithm $\mathcal{A}$ when run without recomputing any intermediate value using a cache memory of size $M$ and where for each I/O operation it is possible to move up to $B$ memory words stored in consecutive memory locations from the cache to slow memory or vice versa, is:

IO_{G}\left(n,M,B\right)\geq\max\left\{\frac{\left(n^{3}-n\right)}{16\sqrt{M}}% -\frac{n\left(n+1\right)}{2}-3M,n\right\}\frac{1}{B}

(5)

Proof.

We prove the result for $B=1$ . The result then trivially generalizes for a generic $B$ .

The fact that $IO_{G}(n,M,1)\geq n$ follows trivially from our assumption that the input is initially stored in the slow memory and therefore must be loaded into the cache at least once using $n$ read I/O operations.

Let $\mathcal{C}$ be any computation schedule for the sequential execution of $\mathcal{A}$ using a cache of size $M$ in which no value corresponding to any vertex of $G$ is computed more than once. We partition $\mathcal{C}$ into non-overlapping segments $\mathcal{C}_{1},\mathcal{C}_{2},\ldots$ such that during each $\mathcal{C}_{i}$ exactly $8M^{1.5}$ values corresponding to distinct $L$ -vertices, denoted as $L_{i}$ , are evaluated. Since $|L|=\left(n^{3}-n\right)/6$ there are $\left\lfloor\left(\frac{n^{3}}{6}-\frac{n}{6}\right)\frac{1}{8M^{1.5}}\right\rfloor$ such segments. Given $L_{i}$ , we refer to the set of $R$ -vertices to whom at least one of the vertices in $L_{i}$ belongs as $R_{i}$ . These correspond to vertices which are active during $\mathcal{C}_{i}$ . For each interval, we denote as $h_{i}$ the number of $R$ -vertices whose value is completely evaluated during $\mathcal{C}_{i}$ . As no value is computed more than once $\sum_{i}h_{i}=|R|=n\left(n+1\right)/2$ . Let $g_{i}$ denote the number of I/O operations executed during $\mathcal{C}_{i}$ . We consider two cases:

$\blacksquare$

The vertices of $L_{i}$ belong to at least $4M+1$ distinct $R$ -vertices: For each of the $4M+1-h_{i}$ values corresponding to $R$ -vertrices that were partially computed in $\mathcal{C}_{i}$ , a partial accumulator of the values corresponding to their leaves computed during $\mathcal{C}_{i}$ must either be in the cache at the end of $\mathcal{C}_{i}$ or must be written into slow memory by means of a write I/O operation. Hence, $g_{i}\geq 4M+1-M-h_{i}$ .
$\blacksquare$

The vertices of $L_{i}$ belong to at most $4M$ distinct $R$ -vertices: By Lemma 3, the vertices in $L_{i}$ have at least $4M$ $R$ -vertex predecessors, of whom at most $h_{i}$ are computed during $\mathcal{C}_{i}$ . As the considered schedule’s values are not computed more than once, each of the remaining $4M-h_{i}$ values corresponding to predecessor vertices of $L_{i}$ must either be in the cache at the beginning of $\mathcal{C}_{i}$ , or read into the cache by means of a read I/O operation. Hence, $g_{i}\geq 4M-M-h_{i}.$

The I/O requirement of the algorithm is obtained by combining the cost of each non-overlapping segment:

	$\displaystyle IO_{G}\left(n,M,1\right)$	$\displaystyle\geq\sum_{i=1}^{\left\lfloor\frac{n^{3}-n}{48M^{1.5}}\right% \rfloor}g_{i}\geq\sum_{i=1}^{\left\lfloor\frac{n^{3}-n}{48M^{1.5}}\right% \rfloor}(3M-h_{i})\geq\left\lfloor\frac{n^{3}-n}{48M^{1.5}}\right\rfloor 3M-% \sum_{i=1}^{\left\lfloor\frac{n^{3}-n}{48M^{1.5}}\right\rfloor}h_{i}$
		$\displaystyle\geq\frac{(n^{3}-n)}{16\sqrt{M}}-3M-\|R\|.\$

$\hfill\blacktriangleleft$ By Theorem 4, we have that for $M\leq cn^{2}$ , for a sufficiently small constant value $c>0$ , computations of CDAGs in $\mathcal{G}\left(n\right)$ in which no value is recomputed require $\Omega\left(\frac{n^{3}}{\sqrt{M}B}\right)$ I/O operations.

On the tightness of the bound

In [13], Cherng and Ladner presented the Valiant’s DP-Closure Algorithm, henceforth referred to as VDP, which allows computing algorithms following the general structure outlined in Prototype Algorithm 1 cache-obliviously and without recomputation. When $M<|R|=\frac{n\left(n+1\right)}{2}$ , VDP executes $\mathcal{O}\left(\frac{n^{3}}{\sqrt{M}B}\right)$ I/O operations, and for $M\geq|R|$ it executes $\mathcal{O}(n/B)$ I/O operations. Thus, we can conclude that our I/O lower bound for schedules with no recomputation provided in Theorem 4 is asymptotically tight.

VDP utilizes a divide and conquer approach, making recursive calls to smaller subproblems. Here we give a simplified presentation of the Valiant’s DP-Closure Algorithm. Please refer to Cherng and Ladner [13] for a detailed presentation.

Consider any algorithm following the Prototype Algorithm 1 with input of size $n-1$ , where we assume $n=2^{a}$ for $a\in\mathbb{N}$ . All subproblems $S(i,j)$ computed by the algorithm are associated with entries of an $n\times n$ matrix $\mathbf{X}$ so that $S(i,j)$ is represented at $\mathbf{X}[i,j+1]$ . Initially, the values $X[i,i+1]$ corresponding to values $S(i,i)$ for $1\leq i\leq n-1$ are computed from the input while the remaining entries in $\mathbf{X}$ are set to the LEAST_OPTIMAL_VALUE (e.g., 0). To fully compute all the subproblems (i.e., to compute the DP-closure), VDP splits $\mathbf{X}$ into $16$ equally-sized submatrices of dimension $\frac{n}{4}$ ²²2We assume that n is a power of $2$ . This can always be made the case by padding $\mathbf{X}$ labeled as follows:

\mathbf{X}=\begin{bmatrix}\mathbf{X}_{11}&\mathbf{X}_{12}&\mathbf{X}_{13}&% \mathbf{X}_{14}\\ &\mathbf{X}_{22}&\mathbf{X}_{23}&\mathbf{X}_{24}\\ &&\mathbf{X}_{33}&\mathbf{X}_{34}\\ &&&\mathbf{X}_{44}\end{bmatrix}

We omit the representation of the lower-triangular submatrices as these will have all of their entries set to $0$ . Valiant’s DP-Closure Algorithm (whose pseudocode is presented in Algorithm 2) is given the matrix $\mathbf{X}$ as input, with the only non-zero values corresponding to subproblems $S(i,i)$ for $1\leq i\leq n-1$ . The values of the $S(i,j)$ (i.e., the $R$ -vertices) corresponding to the top-left and bottom-right submatrices are computed through recursive calls, exploiting the fact that these submatrices correspond to a smaller instantiation of the same DP problem. The remaining values in $\mathbf{X}$ are then computed by invoking a subroutine called “Valiant’s Star Algorithm” (VSTAR).

Algorithm 2 Valiant’s DP-Closure Algorithm (VDP).

VSTAR is given as input a matrix $\mathbf{X}$ where the subproblems in the top-left and bottom-right quadrants are already computed. The algorithm then recursively computes the values corresponding to the remaining entries of $\mathbf{X}$ (i.e., the values of the $R$ -vertices). The pseudocode for this function can be found in Algorithm 3. We represent COMBINE with “ $\cdot$ ” and AGGREGATE with “ $+$ ”. Operations involving these operators in Lines 6,8,10, and 11 are computed using a subroute called “Matrix multiply and accumulate” that, given three equally sized square matrices $\mathbf{A}$ , $\mathbf{B}$ , and $\mathbf{C}$ , computes $A+B\cdot C$ . In the context of VDP, $\mathbf{A}$ contains partially computed values $S(i,j)$ ’s (i.e., corresponding to $R$ -vertices), and $\mathbf{B}\cdot\mathbf{C}$ represents the computation of combinations of subproblems contributing to $S(i,j)$ (i.e., $S(i,k)\cdot S(k+1,j)$ for $i\leq k<j)$ , that is, the values corresponding to the $L$ -vertices, which are then used to update $\mathbf{A}$ . A recursive in-place implementation of this function is presented in detail in [13, 26].

Algorithm 3 Valiant’s Star Algorithm (VSTAR).

4.2 I/O lower bound for computations with allowed recomputation

We extend our I/O lower bound analysis to general computations in which values may be recomputed by combining it with the dominator I/O lower bound technique by Hong and Kung [31].

Definition 5 ([31]).

Given $G=(V,E)$ , a set $D\subseteq V$ is a dominator set for $V^{\prime}\subseteq V$ if every path directed from any input vertex of $G$ to a vertex in $V^{\prime}$ contains a vertex of $D$ .

Dominator sets of $R$ -vertices of a CDAG $G\in\mathcal{G}\left(n\right)$ must satisfy the following property:

Lemma 6.

Given $G=(V,E)\in\mathcal{G}(n)$ , let $A=\{v_{i_{1},j_{1}},v_{i_{2},j_{2}},\ldots\}$ be a subset of $R$ -vertices where $v_{i_{k},j_{k}}$ is the $R$ -vertex corresponding to subproblem $S(i_{k},j_{k})$ . Any dominator set $D$ of $A$ has minimum cardinality $|D|\geq\min\left(|A|,\min{j_{k}-i_{k}}\right)$

Proof.

Suppose for every $v_{i,j}\in A$ , $D$ contains a vertex internal to the tree-CDAG rooted at $v_{i_{k},j_{k}}$ . As the tree-CDAGs of different $R$ -vertices are disjoint, this would imply $|D|\geq|A|$ .

Otherwise, there must be some $v_{i,j}\in A$ , such that $D$ does not contain any of its internal vertices. Let $L_{i,j}=\{l_{0},l_{1},\dots,l_{j-i-1}\}$ denote the set of $L$ -vertices corresponding to $v_{i,j}$ . It is sufficient to show that there are $j-i$ vertex disjoint paths from the inputs of $G$ to vertices in $L_{i,j}$ . By construction, each leaf $l_{k}\in L_{i,j}$ has a predecessor belonging to a unique column $c_{i+k}$ . In turn, each of these predecessors is connected through its internal tree-CDAG to the input vertex in the same column ( $v_{i+k,i+k}$ ). In this manner, $j-i$ paths $\left(v_{i+k,i+k}\rightarrow v_{i,i+k}\rightarrow l_{k}\right)$ for $0\leq k\leq j-i-1$ are obtained. Since each path contains $R$ -vertices from a different column, and the tree-CDAGs of different $R$ -vertices are disjoint, the paths outlined above are vertex-disjoint. $\hfill\blacktriangleleft$

Theorem 7.

For any CDAG $G=(V,E)\in\mathcal{G}(n)$ let $\mathcal{A}$ denote an algorithm corresponding to it. The I/O-complexity of $\mathcal{A}$ when run using a cache memory of size $M$ and where for each I/O operation it is possible to move up to $B$ memory words stored in consecutive memory locations from cache to slow memory or vice versa, is:

IO_{G}\left(n,M,B\right)\geq\max\left\{\frac{\left(n-6M-1\right)^{3}}{18\sqrt{% M}}-2M,n\right\}\frac{1}{B}

(6)

Proof.

To simplify our analysis, we only consider parsimonious execution schedules such that each time an intermediate result is computed, the result is then used to compute at least one of the values of which it is an operand before being removed from the memory (either the cache or slow memory). Any non-parsimonious schedule $\mathcal{C}$ can be reduced to a parsimonious schedule $\mathcal{C}^{\prime}$ by removing all the steps that violate the definition of parsimonious computation. $\mathcal{C}^{\prime}$ has therefore less computational or I/O operations than $\mathcal{C}$ . Hence, restricting the analysis to parsimonious computations leads to no loss of generality.

We prove the result for the case $B=1$ . The result then trivially generalizes for a generic $B$ . The fact that $IO_{\mathcal{A}}(n,M,1)\geq n$ follows trivially from our assumption that input is initially stored in slow memory and therefore must be loaded into the cache at least once using $n$ read I/O operations.

Let $L(x)$ be the set of $L$ -vertices in $G$ whose predecessor $R$ -vertices both correspond to some subproblems $S\left(i,j\right)$ such that $x\leq j-i$ . By the construction, for each subproblem $S\left(i,j\right)$ the corresponding tree sub-CDAG has $\max\{0,j-i-2x\}$ such $L$ -vertices. Since the sub-CDAGs corresponding to each $S\left(i,j\right)$ do not share vertices, we have:

	$\displaystyle\|L\left(x\right)\|$	$\displaystyle=\sum_{j-i=1}^{n-1}\left(n-\left(j-i\right)\right)\max\{0,j-i-2x\}$
		$\displaystyle=\sum_{j-i=2x+1}^{n-1}\left(n-\left(j-i\right)\right)\left(j-i-2x\right)$
		$\displaystyle=\sum_{y=1}^{n-1-2x}\left(n-2x-y\right)y$
		$\displaystyle=\frac{\left(n-2x-1\right)\left(n-2x\right)\left(n-2x+1\right)}{6}$
		$\displaystyle>\frac{\left(n-2x-1\right)^{3}}{6}.$

Let $\mathcal{C}$ be any computation schedule for the sequential execution of $\mathcal{A}$ using a cache of size $M$ . We partition $\mathcal{C}$ into segments $\mathcal{C}_{1},\mathcal{C}_{2},\ldots$ such that during each $\mathcal{C}_{l}$ exactly $6M^{1.5}$ values corresponding to distinct vertices in $L(3M)$ are computed from their operands (i.e., not read from the cache). Let $L_{l}$ denote the set of vertices corresponding to these values. Below we argue that the number $g_{l}$ of I/O operations executed during each of the $\left\lfloor|L(3M)|/6M^{1.5}\right\rfloor$ non-overlapping segments $\mathcal{C}_{l}$ is at least $2M$ , from whence the statement follows.

Let $A$ denote the set of vertices corresponding to the values computed during $\mathcal{C}_{l}$ . Clearly $L_{l}\subseteq A$ . Let $D$ denote the set of vertices corresponding to either the at most $M$ values that are either stored in the cache or the values that are read into the cache during $\mathcal{C}_{l}$ by means of read I/O operations. Thus $g_{l}=|D|-M$ . In order to be possible to compute the values corresponding to vertices in $A$ during $\mathcal{C}_{l}$ there must be no paths from input vertices of $G$ to vertices in $A$ , that is, $D$ must be a dominator set of $A$ .

We refer to the set of $R$ -vertices to whom at least one of the vertices in $L_{l}$ belongs as $R_{l}$ . These correspond to vertices which are active during $\mathcal{C}_{l}$ . We consider three mutually exclusive cases:

(a)

At least one of the values corresponding to a vertex $v^{\prime}$ in $R_{l}$ is entirely computed during $\mathcal{C}_{l}$ . That is no accumulator of the values corresponding to the leaves of $v^{\prime}$ is in the cache or it is loaded in it by means of an read I/O operation during $\mathcal{C}_{l}$ . Thus, $v^{\prime}\in A$ and $D$ do not include any non-leaf vertex internal to the tree-sub-CDAG rooted in $v^{\prime}$ . Since $L_{l}\subseteq L\left(3M\right)$ , by definition, $v$ corresponds to a subproblem $S(i,j)$ such that $j-i\geq 3M$ . From Lemma 6, $|D|\geq 3M$ which implies $g_{j}\geq 2M$ .
(b)

$|R_{l}|>4M$ : As none of the values in $R_{l}$ is entirely computed during $\mathcal{C}_{l}$ (case (a)), and as we are considering parsimonious computations, for each value $R_{l}$ at least one partial accumulator of its value is either in the cache at the beginning (resp., end) of $\mathcal{C}_{l}$ or is loaded into the cache during (resp., saved to the slow memory) during it. Hence $g_{l}\geq 4M-2M$ .
(c)

$|R_{l}|\leq 4M$ : By Lemma 3, the vertices in $L_{l}$ have at least $3M$ distinct $R$ -vertices predecessors. Since $L_{l}\subseteq L\left(3M\right)$ by definition, all such $R$ vertices corresponds to subproblems $S(i,j)$ where $j-i\geq 3M$ . As the values of these vertices are either in the cache at the beginning of $\mathcal{C}_{l}$ , or read to the cache during $\mathcal{C}_{l}$ , or computed during $\mathcal{C}_{l}$ , $D$ must be dominator set for the corresponding vertices. From Lemma 6, we can thus conclude $|D|\geq 3M$ from whence $g_{l}\geq 2M$ .

$\hfill\blacktriangleleft$

By Theorem 7, we have that for $M\leq c_{1}n$ , where $c_{1}$ is a sufficiently small positive constant value, computations of CDAGs in $\mathcal{G}\left(n\right)$ require $\Omega\left(\frac{n^{3}}{\sqrt{M}B}\right)$ I/O operations. This is in contrast with the result given in Theorem 4 for schedules with no recomputation that exhibit I/O complexity $\Omega\left(\frac{n^{3}}{\sqrt{M}B}\right)$ for values of $M$ up to $c_{2}n^{2}$ for an appropriately chosen constant $c_{2}$ .

On the tightness of the bound.

When allowing recomputation, if $M<c_{1}n$ , where $c_{1}$ is a sufficiently small positive constant, the number of I/O operations executed by Valiant’s DP-Closure Algorithm asymptotically matches our lower bound in Theorem 7. Thus, for $M<c_{1}n$ , allowing recomputation does not asymptotically change the required number of I/O operations. On the other hand, if $M>2n$ , our lower bound simplifies to $\Omega(n/B)$ , diverging from the complexity of Valiant’s DP-closure Algorithm.

In Theorem 8, we show how for any algorithm corresponding to $G=\mathcal{G}(n)$ given a cache memory of suitable size $M\geq 2n$ , there exists an execution schedule in which the results of subproblems are computed according to a recursive top-down order that executes only $\lceil n/B\rceil+1$ I/O operations, thus asymptotically matching the general lower bound given in Theorem 7. Crucially, such a schedule involves the recomputation of intermediate results in order to reduce the number of I/O operations otherwise necessary, as established in Theorem 4.

Algorithm 4 Top-Down recursive computation schedule for an algorithm

\mathcal{A}

corresponding to

G\in\mathcal{G}(n)

.

Theorem 8.

Using a cache memory of size $M\geq 2n$ , there exists a computational schedule executing $\mathcal{A}^{*}$ incurring $\lceil n/B\rceil+1$ I/O operations, where for each I/O operation it is possible to move up to $B$ memory words stored in consecutive memory locations from the cache to slow memory or vice versa.

Proof.

The proposed schedule follows the steps outlined in Algorithm 4: in the initialization phase the $n$ I/O values are read into the cache using $\lceil n/B\rceil$ read I/O operations. These values are held in the cache until the very end of the schedule, occupying $n$ out of the $M\geq 2n$ available memory locations. Once the final result $S(1,n)$ is computed, it is written to the secondary memory by means of a write I/O operation.

We argue that the invocation of the procedure COMPUTE $(i,n)$ uses at most $n$ cache memory locations, thus not requiring the execution of any additional I/O operation. Let $\mathcal{M}(j-i)$ denote the memory usage of an invocation of COMPUTE $(i,j)$ . In the base case $\mathcal{M}(1)=1$ , as COMPUTE simply combines and returns the values $S(i,i)$ and $S(j,j)$ which are already stored in the cache after the initialization phase. In the general case, due to the use of accumulators and to the sequence of recursive calls and the memory management used by the schedule, we have:

	$\displaystyle\mathcal{M}(j-i)$	$\displaystyle=1+\max\{\mathcal{M}\left(j-i-1\right),1+\mathcal{M}\left(j-i-2% \right))\}$
		$\displaystyle=1+\mathcal{M}(j-i-1)$
		$\displaystyle=j-i-1+\mathcal{M}(1)$
		$\displaystyle=j-i.$

The lemma follows as $\mathcal{M}(n-1)=n-1$ . $\hfill\blacktriangleleft$

Interestingly, while the proposed computation schedule allows to achieve an asymptotic polynomial reduction of the number of I/O operations by a $\Theta\left(\frac{n^{2}}{\sqrt{M}}\right)$ multiplicative factor using recomputation, this is at the cost of a exponential increase of the number of computation operations executed ( $2^{\mathcal{O}(n)}$ ), in comparison with Valiant’s DP-Closure Algorithm which requires only $\mathcal{O}(n^{3})$ computations.

4.3 I/O analysis for notable Dynamic Programming algorithms

The general result in Section 4 can be adapted to derive asymptotically tight bounds on the I/O complexity of classic Dynamic Programming algorithms for notable problems such as Matrix Chain Multiplication, Optimal Polygon Triangulation, and the construction of optimal binary search trees. This is achieved by mapping each of these algorithms onto Prototype Algorithm 1, defining the COMBINE, AGGREGATE, and LEAST_OPTIMAL_VALUE appropriately. Due to space limitations, these results are presented in the online full version of this work [21, Appendix A].

5 I/O analysis for the Cocke-Younger-Kasami algorithm

In this section, we present I/O bounds for the Cocke-Younger-Kasami algorithm (CYK). First, we provide some background on the algorithm. We then discuss how our lower bound analysis method must be enhanced to better capture the complexity of this algorithm. Finally, we present an upper bound.

5.1 Background on the CYK algorithm

Definition 9 ([41]).

A Context-Free Grammar is a 4-tuple $\left(V,\mathcal{R},\Sigma,T\right)$ where

$\blacksquare$

$V$ is a finite set called the variables;
$\blacksquare$

$\Sigma$ is a finite set, disjoint from $V$ , called the terminals;
$\blacksquare$

$\mathcal{R}$ is a finite relation in $V\times(V\cup\Sigma\cup{\epsilon})^{*}$ , where the ^∗ represents the Kleene star operation and $\epsilon$ denotes the empty string;
$\blacksquare$

$T\in V$ is the start variable.

Elements of $\mathcal{R}$ are referred to as rules and are generally given in the form $A\rightarrow w$ , where $A\in V$ , and $w\in(V\cup\Sigma\cup{\epsilon})^{*}$ . If $u,v,w\in(V\cup\Sigma\cup{\epsilon})^{*}$ and if $A\rightarrow w$ is a rule of the grammar, we say that $u A v$ yields $u w v$ , written as $uAv\Rightarrow uwv$ . We say that $u$ derives $v$ , written $u\Rightarrow^{*}v$ , if $u=v$ or if there exists a sequence $u_{1},u_{2},\ldots,u_{k}$ for $k\geq 1$ such that $u_{i}\in(V\cup\Sigma\cup{\epsilon})^{*}$ and $u\Rightarrow u_{1}\Rightarrow u_{2}\Rightarrow\ldots\Rightarrow u_{k}\Rightarrow v$ . The language of a CFG is the set of all strings $w\in\Sigma^{*}$ that can be derived from the starting variable $T$ , that is $\{w\in\Sigma^{*}\ s.t.\ T\Rightarrow^{*}w\}$ . A CFG is in Chomsky’s Normal Form (CNF) if every rule is in the form: $A\rightarrow BC$ , or $A\rightarrow a$ , or $T\rightarrow\epsilon$ , where $A$ is any variable in $V$ , $B$ and $C$ can be any pair in $\left(V\setminus{S}\right)\times\left(V\setminus{S}\right)$ , and $a$ can be any terminal in $\Sigma$ . Any CFG can be transformed into an equivalent one in CNF through a simple iterative algorithm [41]. We refer to the number of rules $|\mathcal{R}|$ as the size of a CFG. We denote the set of binary rules ( $A\rightarrow BC$ ) in a CFG with $\mathcal{R}_{B}\subset\mathcal{R}$ , the set of terminal rules ( $A\rightarrow a$ ) with $\mathcal{R}_{T}\subset\mathcal{R}$ , and the number of distinct right-hand sides in $\mathcal{R}_{B}$ with $\Gamma$ .

Given a CFG $\left(V,\mathcal{R},\Sigma,T\right)$ in CNF, and a string $w=w_{1}w_{2}\ldots w_{n}$ , the Cocke–Younger–
Kasami (CYK) algorithm, henceforth referred to as $\mathcal{A}_{CYK}$ , decides whether $w$ is part of the language of the CFG. We refer to the standard implementation provided in Algorithm 5. Let $S\left(i,j\right)$ be the set of variables that derive the substring $w_{i}w_{i+1}\ldots w_{j}$ . CYK uses a DP approach to compute all sets $S(i,j)$ for $1\leq i\leq j\leq n$ using an adaptation of prototype algorithm $\mathcal{A}^{*}$ . $\mathcal{A}_{CYK}$ determines whether $w$ is a member of the language generated by the given CFG by checking whether $T\in S(1,n)$ (i.e., whether $T$ derives $w$ ).

Algorithm 5 Psudocode for the Cocke-Younger-Kasami algorithm.

Memory representation model.

We assume that the variables in $V$ can be referenced by index and that such index can be stored in a single memory word. We represent subsets of $V$ (the results of subproblems $S(i,j)$ ) using a one-hot encoding with $\left\lceil|V|/s\right\rceil$ memory words, where $s$ denotes the number of bits in a memory word. Thus, given the representation of one such subset in the memory, to check whether a variable $V_{i}$ is a member of the subset it is sufficient to access the $(i\mod s)$ -th bit of the $\lfloor i/s\rfloor$ -th memory word.

We assume that each grammar rule is represented utilizing at most three memory words each used to represent the index of the left-hand side and at most two to represent the right-hand side (either the index of the two variables or the code for a terminal symbol).

5.2 I/O lower bound

While the CYK algorithm shares the underlying structure of relations between subproblems of the prototype algorithm $\mathcal{A}^{*}$ , it presents several non-trivial complications due to the application of the grammar rules that, in turn, require a refinement of our analysis methods and, in particular, of the CDAG construction. In this section, we will assume that the set of grammar rules $\mathcal{R}$ are such that no two rules have the same right-hand side. While this might not always hold in practice, our analysis can still be applied by considering a subset of rules of the considered grammar which satisfies this condition.

Figure 3: Each subproblem

S(i,j)

is represented by multiple variable roots (in black), which is the result of the composition multiple grammar roots (in purple), which in turn are the composition of multiple leaves (in green).

CYK CDAG construction.

The construction of the CDAG representing CYK’s execution for an input string $w$ with $n$ characters and a CFG in CNF $\left(V,\mathcal{R},\Sigma,T\right)$ , called $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ , follows a recursive structure which uses as a basis $G=\left(V,E\right)\in\mathcal{G}(n)$ . We describe the necessary modification focusing on the sub-CDAG for a single $S(i,j)$ subproblem. A visual depiction of the ensuing description is provided in Figure 3:

$\blacksquare$

For the base of the construction, for each $S(i,i)$ , the CDAG has $|V|/s$ vertices, each corresponding to the encoding of one of the possible variables of the grammar. These vertices serve as the input of the CDAG.
$\blacksquare$

Each $R$ -vertex of $G$ , corresponding to the computation of the solution of one of the subproblems $S(i,j)$ , is replaced by $|V|/s$ vertices, each representing a part of the encoding of the set of variables $V$ . We refer to this set of vertices as “variable roots” $\mathit{VR}$ .
$\blacksquare$

Consider a variable root $v$ corresponding to subproblem $S(i,j)$ and variable $A\in V$ . $v$ forms the root of a binary tree, which has as its leaves vertices for each rule having $A$ as its left-hand side. We denote the set of these vertices as “grammar roots” $\mathit{GR}$ . Note that as only one variable can appear on the left-hand side of a rule, the trees composed in this way are vertex disjoint. Further, for each $S(i,j))$ there will be at least $\Gamma/s$ such vertices.
$\blacksquare$

For each subproblem $S(i,j)$ , consider one of its grammar roots $u$ corresponding to rule $\gamma\in\mathcal{R}_{B}$ . $u$ forms the root of a binary tree combining the results of $j-i$ “leaf-vertices”. Each leaf vertex corresponds to a combination of the subproblems $S(i,k)$ and $S(k+1,j)$ , for $i\leq k<j$ . Such a vertex has as predecessors the variable root vertices of subproblems $S(i,k)$ and $S(k+1,j)$ encoding the variables appearing in the right-hand side of rule $\gamma$ .

Given the CDAG $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ , we generalize the notion of rows (resp., columns) introduced in Section 3 by having row $r_{i}$ (resp. column $c_{j}$ ) include all vertices in $\mathit{VR}$ corresponding to the encoding of the results of subproblems $S(i,j)$ for $1\leq j\leq n$ (resp., for $1\leq i\leq n$ ).The definition of $W$ -cover generalizes to $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ . We observe that a property analogous to that outlined in Lemma 1 holds for $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ as well:

Lemma 10.

Given a CDAG $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ , let $W_{1},\ldots,W_{n-1}$ denote its $W$ -cover. Any set of $x$ $\mathit{VR}$ -vertices in $W_{i}$ contain at most $x^{2}/4$ interacting pairs, each of which forms the predecessors of a unique $L$ -vertex belonging to a unique $\mathit{GR}$ -vertex.

Proof.

Consider any set of $x$ $\mathit{VR}$ -vertices in $W_{i}$ containing $x_{c}$ vertices in column $i$ and $x-x_{c}$ vertices in row $i+1$ . Since vertices in the same row (or column) don’t interact, there are at most $x\left(x-x_{c}\right)\leq x^{2}/4$ interacting pairs.

Consider any two such distinct interacting pairs: $\left(v^{A}_{k,i},v^{B}_{i+1,l}\right)$ and $\left(v^{C}_{m,i},v^{D}_{i+1,n}\right)$ (where $v^{A}_{k,i}$ is a $\mathit{VR}$ -vertex corresponding to variable $A$ in row $k$ and column $i$ ). Any leaf produced by $\left(v^{A}_{k,i},v^{B}_{i+1,l}\right)$ must correspond to a rule with right-hand side “AB”. From the assumption that every rule in $\mathcal{R}$ has a unique right-hand side, it follows that there is only one such rule and therefore $\left(v^{A}_{k,i},v^{B}_{i+1,l}\right)$ must produce a single leaf.

To show that the leaves produced by $\left(v^{A}_{k,i},v^{B}_{i+1,l}\right)$ and $\left(v^{C}_{m,i},v^{D}_{i+1,n}\right)$ belong to distinct $G R$ -vertices, consider the following two cases:

1.

If $k\neq m$ or $l\neq i$ , it follows from Lemma 1 that the leaves produced belong to different roots, and therefore must also belong to different grammar roots.
2.

Otherwise, it must be the case that $A\neq C$ or $B\neq D$ (since the two interacting pairs are assumed to be distinct). It follows that the two leaves produced correspond to different rules in $\mathcal{R}$ and therefore belong to distinct grammar roots.

$\hfill\blacktriangleleft$

I/O lower bound proof.

We analyze the I/O complexity of any sequential computation for the CYK algorithm by analyzing the CDAG corresponding to the given CFG and the input string to be considered. We first state two results that correspond each to a modification of the results in Lemma 3 and, respectively, of Lemma 6 for the family of CDAG corresponding to execution of the CYK algorithm:

Lemma 11.

Given a CDAG $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ , let $L^{\prime}\subset L$ be such that vertices in $L^{\prime}$ belong to the tree sub-DAGs of at most $\rho$ distinct $\mathit{GR}$ -vertices. The set $R^{\prime}\subseteq\mathit{VR}$ that includes all $\mathit{VR}$ -vertices that are predecessors of at least one vertex in $L^{\prime}$ has cardinality at least $|L^{\prime}|/\sqrt{\rho}$ .

Proof.

Consider the $W$ -cover of $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ with the adapted definition for the CYK CDAG discussed in Section 5.2. As any $V R$ -vertex can be a member of at most two distinct $W_{i}$ ’s we have:

|R^{\prime}|\geq\frac{1}{2}\sum_{i=1}^{n-1}|R^{\prime}\cap W_{i}|.

(7)

By definition, each vertex in $L$ has a distinct pair of $V R$ -vertex predecessors which, by construction, are both included in exactly one set $W_{i}$ . Let $a_{i}$ be the number of vertices from $L^{\prime}$ with both predecessors in $W_{i}$ . By the assumptions we have:

\sum_{i=1}^{n-1}a_{i}=|L^{\prime}|

(8)

By Lemma 10, for any $W_{i}$ , each interacting pair of $V R$ -vertices in it are the predecessors of a distinct $L$ -vertex belonging to the tree sub-CDAG of a distinct $G R$ -vertex. From the assumption that vertices in $L^{\prime}$ belong to at most $\rho$ distinct roots, we have:

\max_{i=1,\ldots,n-1}a_{i}\leq\rho,

(9)

Furthermore, from Lemma 10, we know that vertices in $|R^{\prime}\cap W_{i}|$ can interact to produce at most $|R^{\prime}\cap W_{i}|^{2}/4$ vertices in $L^{\prime}$ . Thus,

\sum_{i=1}^{n-1}\frac{|R^{\prime}\cap W_{i}|^{2}}{4}\geq\sum_{i=1}^{n-1}a_{i}% \Rightarrow\sum_{i=1}^{n-1}|R^{\prime}\cap W_{i}|\geq\sum_{i=1}^{n-1}2\sqrt{a_% {i}}

(10)

By (7), (10), (8), (9), and Lemma 2 (in this order) we have:

|R^{\prime}|\geq\frac{1}{2}\sum_{i=1}^{n-1}|R^{\prime}\cap W_{i}|\geq\sum_{i=1% }^{n-1}\sqrt{a_{i}}\geq\frac{|L^{\prime}|}{\sqrt{\rho}}.\

$\hfill\blacktriangleleft$

The proof follows a reasoning analogous to that used to prove Lemma 3 adapted for the CDAG $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ by using the property of $L$ -vertices of the CYK CDAG states in Lemma 10 in place of Lemma 1.

Lemma 12.

Given $G_{CYK}\left(n,\left(V,\mathcal{R},\Sigma,T\right)\right)$ , let $A=\{v_{i_{1},j_{1}},v_{i_{2},j_{2}},\ldots\}$ be a subset of $G R$ -vertices where $v_{i_{l},j_{k}}$ is a $G R$ -vertex corresponding to subproblem $S(i_{k},j_{k})$ . Any dominator set $D$ of $A$ has cardinality at least $|D|\geq\min\left(|A|,\min{j_{k}-i_{k}}\right)$ .

The proof corresponds to the one of Lemma 6 where the tree sub-CDAGs rooted in $G R$ -vertices are considered instead of the $R$ vertices of $\mathcal{G}(n)$ .

A lower bound to the I/O complexity of the CYK algorithm can be obtained by following steps analogous to those used in the proof of Theorem 7 with opportune adjustments matching the modifications in the enhanced CDAG described in this section.

Theorem 13.

Consider a string $w$ of length $n$ and a CFG $\left(V,\mathcal{R},\Sigma,T\right)$ in CNF. The number of I/O operations executed by $\mathcal{A}_{CYK}$ when deciding whether $w$ is a member of the language of the given grammar using a machine equipped with a cache memory of size $M$ and such that $B$ contiguous can be moved with each I/O operation is bounded as:

IO_{G_{CYK}}\left(n,m,B\right)\geq\max\left\{\frac{\left(n-6M-1\right)^{3}}{18% \sqrt{M}}\Gamma-2M\Gamma,n\right\}\frac{1}{B}

where $\Gamma$ denotes the number of distinct right-hand sides in $\mathcal{R}_{B}$ .

By Theorem 13, we have that for $M\leq c_{1}n$ , where $c_{1}$ is a sufficiently small positive constant value, computations of the CYK algorithm require $\Omega\left(\frac{n^{3}\Gamma}{\sqrt{M}B}\right)$ I/O operations. Note that if the considered CFG is such that $\Gamma=\Theta\left(|\mathcal{R}_{B}|\right)$ our lower bound shows a direct dependence of the I/O complexity of execution of the CYK algorithm on the size of the considered grammar. Indeed there are many possible grammars for which this is the case. Finally, while the result of Theorem 7 holds for computations with recomputation, it is also possible to obtain an I/O lower bound for schedules in which recomputation is not allowed by modifying the analysis given in Theorem 4.

5.3 On the tightness of the bounds

We refine Valiant’s DP-Closure Algorithm [13] to obtain an I/O efficient recursive divide-and-conquer implementation of the CYK algorithm, henceforth referred to as $\mathcal{A}_{CYK}$ . While the CYK algorithm follows the general structure of the Prototype Algorithm 1, and is hence amenable to the approach in [13], the analysis in [13] does not directly apply to the case in which the size of the given CNF cannot be considered as constant with respect to the size of the cache. Our algorithm, $\mathcal{A}_{CYK}$ , manages I/O efficiently by opportunely iterating over the grammar rules and by removing redundant computations due to grammar rules with the same right-hand side.

Given an input string $w$ of length $n-1$ and a CFG $\left(V,\mathcal{R},\Sigma,T\right)$ in CNF, we use an $n\times n$ matrix $\mathbf{X}$ to represent the the values of the subproblems computed by the CYK algorithm (corresponding to $R$ -vertices as defined in Section 5.2): $\mathbf{X}[i,j+1]$ corresponds to the solution of subproblem $S(i,j)$ , i.e., a sequence of length $|V|$ with binary values, where the index corresponding to variable $A\in V$ represents whether variable $A$ can derive the substring $w_{i}w_{i+1}\ldots w_{j}$ of the input.

First, $\mathcal{A}_{CYK}$ computes the input values corresponding to subproblems $S(i,i)$ for $1\leq i\leq n$ (i.e., the values of $\mathbf{X}[i,i+1]$ ): Each character in the input string $w$ is compared with the right-hand side of each terminal rule in $\mathcal{R}_{T}$ , and the appropriate values are stored. The rest of the computation involves executing Valiant’s DP-Closure Algorithm (Algorithm 2) using a modified version of Valiant’s Star Algorithm (Algorithm 3, called “Valiant’s Star Algorithm for CYK” ( $\mathcal{A}^{*}_{CYK}$ ) which is presented in Algorithm 6.

Algorithm 6

\mathcal{A}^{*}_{CYK}

, Valiant’s Star Algorithm for CYK.

$\mathcal{A}^{*}_{CYK}$ introduces two modifications over Valiant’s Star Algorithm: First, the algorithm iterates over the rules of the grammar with distinct right-hand-sides in lines $8$ , $11$ , and $14$ , calling the subroutine Matrix Multiply and Accumulate with the variable roots corresponding to each rule in the grammar. Second, to avoid computing rules with identical right-hand sides more than once, let $\mathcal{R}_{B}^{\Gamma}$ denote a set of binary rules each having a unique right-hand side from $\mathcal{R}_{B}$ of maximum cardinality. In this set, the corresponding left-hand sides are replaced with placeholder variables. For any subproblem, once the value corresponding to a rule in $\mathcal{R}_{B}^{\Gamma}$ is fully computed, it can be used to update the value corresponding to all variable roots producing the same right-hand side through a rule in $\mathcal{R}_{B}$ . This is done in lines $4$ and $5$ as the base case implies that all leaves for the subproblem have been computed.

Theorem 14.

Given an input string $w$ of length $n$ and a CFG $\left(V,\mathcal{R},\Sigma,T\right)$ in CNF, algorithm $\mathcal{A}_{CYK}$ decides whether $w$ is a member of the language of the given grammar. When run on a machine equipped with a cache of size $M$ , the number of I/O operations executed by $\mathcal{A}_{CYK}$ can be bound as:

IO_{\mathcal{A}_{CYK}}\left(n,M\right)\leq\mathcal{O}\left(\left(\frac{n^{3}% \Gamma}{\sqrt{M}}+n^{2}\Gamma\log{M}+n^{2}|\mathcal{R}_{B}|+n|\mathcal{R}_{T}|% \right)\frac{1}{B}\right).

Proof.

When computing $S(i,i)$ for $1\leq i\leq n$ , $\mathcal{A}_{CYK}$ compares each character of input string $w$ with the right-hand side of each terminal rule in $\mathcal{R}_{T}$ . Doing so incurs at most $O(n|\mathcal{R}|_{T}/B)$ I/O operations.

In the following, we use $n$ to denote the dimension of the input (i.e., $n=dim(X)$ ). The number of I/O operations executed by Valiant’s Star Algorithm for CYK is at most:

IO_{\mathcal{A}_{CYK^{*}}}\left(n\right)\leq\begin{cases}\mathcal{O}\left(% \frac{|\mathcal{R}_{B}|}{B}\right)&\text{if }n\leq 2,\\ 4IO_{\mathcal{A}_{CYK^{*}}}\left(\frac{n}{2}\right)+4\Gamma(IO_{MMA}\left(% \frac{n}{4}\right)+O(1))&\text{otherwise}\par\end{cases}

(11)

By [13, Lemma 2.1], the number of I/O operations executed by Matrix Multiply and Accumulate is:

IO_{MMA}\left(n\right)\leq\begin{cases}\mathcal{O}(\frac{n^{2}}{B})&\text{if }% n^{2}\leq M,\\ \mathcal{O}(\frac{n^{3}}{B\sqrt{M}})&\text{otherwise}\end{cases}

(12)

Solving the recursion in (11) using (12) yields:

	$\displaystyle IO_{\mathcal{A}_{CYK^{*}}}\left(n\right)$	$\displaystyle\leq\mathcal{O}\left(\frac{n^{3}\Gamma}{B\sqrt{M}}\right)+4^{\log% {(n/\sqrt{M})}}\mathcal{O}\left(\frac{M\Gamma}{B}\log{M}+\frac{M\|\mathcal{R}_{% B}\|}{B}\right)$
		$\displaystyle\leq\mathcal{O}\left(\frac{n^{3}\Gamma}{B\sqrt{M}}+\frac{n^{2}% \Gamma}{B}\log{M}+\frac{n^{2}\|\mathcal{R}_{B}\|}{B}\right)$		(13)

Finally, by adapting that analysis in [13] we can bound the number of I/O operations executed by the proposed algorithm $\mathcal{A}_{CYK}$ as:

IO_{\mathcal{A}_{CYK}}\left(n\right)\leq\begin{cases}\mathcal{O}(1)&\text{if }% n\leq 2,\\ 2IO_{\mathcal{A}_{CYK}}\left(n/2\right)+IO_{\mathcal{A}_{CYK}^{*}}\left(n% \right)+\mathcal{O}(1)&\text{otherwise}\end{cases}

Expanding the recurrence, by (5.3), and adding the cost of processing the input, we have:

IO_{\mathcal{A}_{CYK}}\left(n\right)\leq\mathcal{O}\left(\frac{n^{3}\Gamma}{B% \sqrt{M}}+\frac{n^{2}\Gamma}{B}\log{M}+\frac{n^{2}|\mathcal{R}_{B}|}{B}+\frac{% n|\mathcal{R}_{T}|}{B}\right)\

$\hfill\blacktriangleleft$

When $M<c_{1}n$ , where $c_{1}$ is a sufficiently small positive constant, the lower bound in Theorem 13 simplifies to $\Omega\left(\frac{n^{3}\Gamma}{B\sqrt{M}}\right)$ , while the upper bound in Theorem 14 becomes $\mathcal{O}(\frac{n^{3}\Gamma}{B\sqrt{M}}+\frac{n^{2}|\mathcal{R}_{B}|}{B}+% \frac{n|\mathcal{R}_{T}|}{B})$ . Hence, if $(n|\mathcal{R}_{B}|+|\mathcal{R}_{T}|)/\Gamma\in\mathcal{O}(n^{2}/\sqrt{M})$ , the lower-bound asymptotically matches the upper-bound. When disallowing recomputation, a similar analysis shows our upper and lower bounds match under the same condition for $M<c_{2}n$ , where $c_{2}$ is a sufficiently small positive constant.

6 Conclusion

This work contributed to the characterization of the I/O complexity of Dynamic Programming algorithms by establishing asymptotically tight lower bounds for a general class of DP algorithms sharing a common structure of sub-problem dependence. Our technique exploits common properties of the CDAGs corresponding to said algorithms, which makes it promising for the analysis of other families of recursive algorithms, in particular other families of DP algorithms of interest. The generality of our technique is further showcased by the ability to extend it to more complex algorithms, such as the Cocke-Younger-Kasami, for which we provide an (almost) asymptotically tight I/O lower bound and a matching algorithm.

Our analysis yields lower bounds both for computations in which no intermediate value is computed more than once and for general computations allowing recomputation. By doing so we reveal how when the size of the available cache is greater than $2n$ and $o\left(n^{2}\right)$ , schedules using recomputation can achieve an asymptotic reduction of the number of required I/O operations by a factor $\Theta\left(n^{2}/\sqrt{M}\right)$ . This is particularly significant as in many cases of interest (e.g., Matrix Multiplication, Fast Fourier Transform) recomputation has been shown to enable a reduction of the I/O cost by at most a constant multiplicative factor.

Although it is known that recomputation can decrease the space and I/O complexity of certain CDAGs, we are still far from characterizing those CDAGs for which recomputation proves effective. This overarching objective remains a challenge for any efforts aimed at developing a general theory of the communication requirements of computations.

References

[1] Alok Aggarwal and S Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988. doi:10.1145/48529.48535.
[2] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 245–252, 2009.
[3] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM), 59(6):1–23, 2013. doi:10.1145/2395116.2395121.
[4] Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.
[5] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Strassen’s matrix multiplication with recomputation. In Workshop on Algorithms and Data Structures, pages 181–192. Springer, 2017. doi:10.1007/978-3-319-62127-2_16.
[6] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Toom-Cook integer multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2034–2052. SIAM, 2019. doi:10.1137/1.9781611975482.123.
[7] Gianfranco Bilardi and Lorenzo De Stefani. The DAG Visit Approach for Pebbling and I/O Lower Bounds. In In Proc. FSTTCS 2022, volume 250, pages 7:1–7:23, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.FSTTCS.2022.7.
[8] Gianfranco Bilardi and Enoch Peserico. A characterization of temporal locality and its portability across memory hierarchies. In International Colloquium on Automata, Languages, and Programming, pages 128–139. Springer, 2001. doi:10.1007/3-540-48224-5_11.
[9] Gianfranco Bilardi, Andrea Pietracaprina, and Paolo D’Alberto. On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop, WG 2000 Konstanz, Germany, June 15–17, 2000 Proceedings 26, pages 47–58. Springer, 2000. doi:10.1007/3-540-40064-8_6.
[10] Gianfranco Bilardi and Franco P Preparata. Horizons of parallel computation. Journal of Parallel and Distributed Computing, 27(2):172–182, 1995. doi:10.1006/JPDC.1995.1080.
[11] Gianfranco Bilardi and Franco P Preparata. Processor—time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory of Computing Systems, 32(5):531–559, 1999. doi:10.1007/S002240000131.
[12] Guy E Blleloch and Yan Gu. Improved parallel cache-oblivious algorithms for dynamic programming and linear algebra. arXiv preprint, 2018. arXiv:1809.09330.
[13] Cary Cherng and Richard E Ladner. Cache efficient simple dynamic programming. Discrete Mathematics & Theoretical Computer Science, 2005.
[14] Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi. Provably efficient scheduling of cache-oblivious wavefront algorithms. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 339–350, 2017. doi:10.1145/3087556.3087586.
[15] Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles E Leiserson, Armando Solar-Lezama, Bradley C Kuszmaul, and Yuan Tang. Autogen: Automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems. ACM Transactions on Parallel Computing (TOPC), 4(1):1–30, 2017. doi:10.1145/3125632.
[16] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-oblivious dynamic programming. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600. Citeseer, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109622.
[17] Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In Proceedings of the nineteenth annual ACM Symposium on Parallel Algorithms and Architectures, pages 71–80, 2007. doi:10.1145/1248377.1248392.
[18] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 207–216, 2008. doi:10.1145/1378533.1378574.
[19] Rezaul Alan Chowdhury, Hai-Son Le, and Vijaya Ramachandran. Cache-oblivious dynamic programming for bioinformatics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3):495–510, 2008.
[20] John Cocke. Programming languages and their compilers: Preliminary notes. New York University, 1969.
[21] Lorenzo De Stefani and Vedant Gupta. On the I/O complexity of the CYK algorithm and of a family of related DP algorithms. arXiv preprint arXiv:2410.20337, 2024. doi:10.48550/arXiv.2410.20337.
[22] Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vassilevska Williams. Fine-grained I/O complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. arXiv preprint arXiv:1711.07960, 2017. arXiv:1711.07960.
[23] Xiangyun Ding, Yan Gu, and Yihan Sun. Parallel and (nearly) work-efficient dynamic programming. In Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures, pages 219–232, 2024. doi:10.1145/3626183.3659958.
[24] Aaron Dunlop, Nathan Bodenstab, and Brian Roark. Reducing the grammar constant: an analysis of CYK parsing efficiency. Technical report, Technical report CSLU-2010-02, OHSU, 2010.
[25] Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, and Roy H Campbell. An algorithm for fast edit distance computation on GPUs. In 2012 Innovative Parallel Computing (InPar), pages 1–9. IEEE, 2012.
[26] Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 285–297. IEEE, 1999. doi:10.1109/SFFCS.1999.814600.
[27] Zvi Galil and Kunsoo Park. Dynamic programming with convexity, concavity and sparsity. Theoretical Computer Science, 92(1):49–76, 1992. doi:10.1016/0304-3975(92)90135-3.
[28] Te Chiang Hu and Man Tak Shing. Computation of matrix chain products. Part I. SIAM Journal on Computing, 11(2):362–373, 1982. doi:10.1137/0211028.
[29] Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. ACM SIGPLAN Notices, 51(10):145–164, 2016.
[30] Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings 34, pages 143–164. Springer, 2019. doi:10.1007/978-3-030-20656-7_8.
[31] Hong Jia-Wei. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, 1981.
[32] Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257, 1966.
[33] Donald E. Knuth. Optimum binary search trees. Acta informatica, 1:14–25, 1971. doi:10.1007/BF00264289.
[34] Richard Koch, Tom Leighton, Bruce Maggs, and Satish Rao. Work-preserving emulations of fixed-connection networks. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 227–240, 1989.
[35] Martin Lange and Hans Leiß. To CNF or not to CNF? an efficient yet presentable version of the cyk algorithm. Informatica Didactica, 8(2009):1–21, 2009. URL: http://www.informatica-didactica.de/cmsmadesimple/index.php?page=LangeLeiss2009.
[36] J-S Park, Michael Penner, and Viktor K Prasanna. Optimizing graph algorithms for improved cache performance. IEEE Transactions on parallel and distributed systems, 15(9):769–782, 2004. doi:10.1109/TPDS.2004.44.
[37] Cynthia A Patterson, Marc Snir, and Susan L Graham. Getting up to speed: The future of supercomputing. National Academies Press, 2005.
[38] Vijaya Ramachandran. Cache-oblivious computation: Algorithms and experimental evaluation. In 2007 International Conference on Computing: Theory and Applications (ICCTA’07), pages 20–26. IEEE, 2007. doi:10.1109/ICCTA.2007.34.
[39] John E Savage. Extending the Hong-Kung model to memory hierarchies. In International Computing and Combinatorics Conference, pages 270–281. Springer, 1995. doi:10.1007/BFB0030842.
[40] Jacob Scott, Olga Holtz, and Oded Schwartz. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, pages 35–45, 2015. doi:10.1145/2755573.2755594.
[41] Michael Sipser. Introduction to the Theory of Computation. ACM Sigact News, 27(1):27–29, 1996. doi:10.1145/230514.571645.
[42] Xinying Song, Shilin Ding, and Chin-Yew Lin. Better binarization for the CYK parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 167–176, 2008. URL: https://aclanthology.org/D08-1018/.
[43] Shanjiang Tang, Ce Yu, Jizhou Sun, Bu-Sung Lee, Tao Zhang, Zhen Xu, and Huabei Wu. Easypdp: An efficient parallel dynamic programming runtime system for computational biology. IEEE Transactions on Parallel and Distributed Systems, 23(5):862–872, 2011. doi:10.1109/TPDS.2011.218.
[44] Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A Chowdhury. Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 205–214, 2015.
[45] Leslie Valiant. General context-free recognition in less than cubic time. J. Comput. Syst. Sci., 1974.
[46] F Frances Yao. Efficient dynamic programming using quadrangle inequalities. In Proceedings of the twelfth annual ACM symposium on Theory of computing, pages 429–435, 1980. doi:10.1145/800141.804691.
[47] Daniel H Younger. Recognition and parsing of context-free languages in time $n^{3}$ . Information and control, 10(2):189–208, 1967. doi:10.1016/S0019-9958(67)80007-X.

[bib.bib1] [1] Alok Aggarwal and S Vitter, Jeffrey. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, 1988. doi:10.1145/48529.48535.

[bib.bib2] [2] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Communication-optimal parallel and sequential Cholesky decomposition. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 245–252, 2009.

[bib.bib3] [3] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM), 59(6):1–23, 2013. doi:10.1145/2395116.2395121.

[bib.bib4] [4] Richard Bellman. The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6):503–515, 1954.

[bib.bib5] [5] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Strassen’s matrix multiplication with recomputation. In Workshop on Algorithms and Data Structures, pages 181–192. Springer, 2017. doi:10.1007/978-3-319-62127-2_16.

[bib.bib6] [6] Gianfranco Bilardi and Lorenzo De Stefani. The I/O complexity of Toom-Cook integer multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2034–2052. SIAM, 2019. doi:10.1137/1.9781611975482.123.

[bib.bib7] [7] Gianfranco Bilardi and Lorenzo De Stefani. The DAG Visit Approach for Pebbling and I/O Lower Bounds. In In Proc. FSTTCS 2022, volume 250, pages 7:1–7:23, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.FSTTCS.2022.7.

[bib.bib8] [8] Gianfranco Bilardi and Enoch Peserico. A characterization of temporal locality and its portability across memory hierarchies. In International Colloquium on Automata, Languages, and Programming, pages 128–139. Springer, 2001. doi:10.1007/3-540-48224-5_11.

[bib.bib9] [9] Gianfranco Bilardi, Andrea Pietracaprina, and Paolo D’Alberto. On the space and access complexity of computation DAGs. In Graph-Theoretic Concepts in Computer Science: 26th International Workshop, WG 2000 Konstanz, Germany, June 15–17, 2000 Proceedings 26, pages 47–58. Springer, 2000. doi:10.1007/3-540-40064-8_6.

[bib.bib10] [10] Gianfranco Bilardi and Franco P Preparata. Horizons of parallel computation. Journal of Parallel and Distributed Computing, 27(2):172–182, 1995. doi:10.1006/JPDC.1995.1080.

[bib.bib11] [11] Gianfranco Bilardi and Franco P Preparata. Processor—time tradeoffs under bounded-speed message propagation: Part II, lower bounds. Theory of Computing Systems, 32(5):531–559, 1999. doi:10.1007/S002240000131.

[bib.bib12] [12] Guy E Blleloch and Yan Gu. Improved parallel cache-oblivious algorithms for dynamic programming and linear algebra. arXiv preprint, 2018. arXiv:1809.09330.

[bib.bib13] [13] Cary Cherng and Richard E Ladner. Cache efficient simple dynamic programming. Discrete Mathematics & Theoretical Computer Science, 2005.

[bib.bib14] [14] Rezaul Chowdhury, Pramod Ganapathi, Yuan Tang, and Jesmin Jahan Tithi. Provably efficient scheduling of cache-oblivious wavefront algorithms. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 339–350, 2017. doi:10.1145/3087556.3087586.

[bib.bib15] [15] Rezaul Chowdhury, Pramod Ganapathi, Stephen Tschudi, Jesmin Jahan Tithi, Charles Bachmeier, Charles E Leiserson, Armando Solar-Lezama, Bradley C Kuszmaul, and Yuan Tang. Autogen: Automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems. ACM Transactions on Parallel Computing (TOPC), 4(1):1–30, 2017. doi:10.1145/3125632.

[bib.bib16] [16] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-oblivious dynamic programming. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600. Citeseer, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109622.

[bib.bib17] [17] Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation. In Proceedings of the nineteenth annual ACM Symposium on Parallel Algorithms and Architectures, pages 71–80, 2007. doi:10.1145/1248377.1248392.

[bib.bib18] [18] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 207–216, 2008. doi:10.1145/1378533.1378574.

[bib.bib19] [19] Rezaul Alan Chowdhury, Hai-Son Le, and Vijaya Ramachandran. Cache-oblivious dynamic programming for bioinformatics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3):495–510, 2008.

[bib.bib20] [20] John Cocke. Programming languages and their compilers: Preliminary notes. New York University, 1969.

[bib.bib21] [21] Lorenzo De Stefani and Vedant Gupta. On the I/O complexity of the CYK algorithm and of a family of related DP algorithms. arXiv preprint arXiv:2410.20337, 2024. doi:10.48550/arXiv.2410.20337.

[bib.bib22] [22] Erik D Demaine, Andrea Lincoln, Quanquan C Liu, Jayson Lynch, and Virginia Vassilevska Williams. Fine-grained I/O complexity via reductions: New lower bounds, faster algorithms, and a time hierarchy. arXiv preprint arXiv:1711.07960, 2017. arXiv:1711.07960.

[bib.bib23] [23] Xiangyun Ding, Yan Gu, and Yihan Sun. Parallel and (nearly) work-efficient dynamic programming. In Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures, pages 219–232, 2024. doi:10.1145/3626183.3659958.

[bib.bib24] [24] Aaron Dunlop, Nathan Bodenstab, and Brian Roark. Reducing the grammar constant: an analysis of CYK parsing efficiency. Technical report, Technical report CSLU-2010-02, OHSU, 2010.

[bib.bib25] [25] Reza Farivar, Harshit Kharbanda, Shivaram Venkataraman, and Roy H Campbell. An algorithm for fast edit distance computation on GPUs. In 2012 Innovative Parallel Computing (InPar), pages 1–9. IEEE, 2012.

[bib.bib26] [26] Matteo Frigo, Charles E Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 285–297. IEEE, 1999. doi:10.1109/SFFCS.1999.814600.

[bib.bib27] [27] Zvi Galil and Kunsoo Park. Dynamic programming with convexity, concavity and sparsity. Theoretical Computer Science, 92(1):49–76, 1992. doi:10.1016/0304-3975(92)90135-3.

[bib.bib28] [28] Te Chiang Hu and Man Tak Shing. Computation of matrix chain products. Part I. SIAM Journal on Computing, 11(2):362–373, 1982. doi:10.1137/0211028.

[bib.bib29] [29] Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan Lu, Charles Leiserson, and Rezaul Chowdhury. Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. ACM SIGPLAN Notices, 51(10):145–164, 2016.

[bib.bib30] [30] Mohammad Mahdi Javanmard, Pramod Ganapathi, Rathish Das, Zafar Ahmad, Stephen Tschudi, and Rezaul Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings 34, pages 143–164. Springer, 2019. doi:10.1007/978-3-030-20656-7_8.

[bib.bib31] [31] Hong Jia-Wei. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 326–333, 1981.

[bib.bib32] [32] Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257, 1966.

[bib.bib33] [33] Donald E. Knuth. Optimum binary search trees. Acta informatica, 1:14–25, 1971. doi:10.1007/BF00264289.

[bib.bib34] [34] Richard Koch, Tom Leighton, Bruce Maggs, and Satish Rao. Work-preserving emulations of fixed-connection networks. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 227–240, 1989.

[bib.bib35] [35] Martin Lange and Hans Leiß. To CNF or not to CNF? an efficient yet presentable version of the cyk algorithm. Informatica Didactica, 8(2009):1–21, 2009. URL: http://www.informatica-didactica.de/cmsmadesimple/index.php?page=LangeLeiss2009.

[bib.bib36] [36] J-S Park, Michael Penner, and Viktor K Prasanna. Optimizing graph algorithms for improved cache performance. IEEE Transactions on parallel and distributed systems, 15(9):769–782, 2004. doi:10.1109/TPDS.2004.44.

[bib.bib37] [37] Cynthia A Patterson, Marc Snir, and Susan L Graham. Getting up to speed: The future of supercomputing. National Academies Press, 2005.

[bib.bib38] [38] Vijaya Ramachandran. Cache-oblivious computation: Algorithms and experimental evaluation. In 2007 International Conference on Computing: Theory and Applications (ICCTA’07), pages 20–26. IEEE, 2007. doi:10.1109/ICCTA.2007.34.

[bib.bib39] [39] John E Savage. Extending the Hong-Kung model to memory hierarchies. In International Computing and Combinatorics Conference, pages 270–281. Springer, 1995. doi:10.1007/BFB0030842.

[bib.bib40] [40] Jacob Scott, Olga Holtz, and Oded Schwartz. Matrix multiplication I/O-complexity by path routing. In Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures, pages 35–45, 2015. doi:10.1145/2755573.2755594.

[bib.bib41] [41] Michael Sipser. Introduction to the Theory of Computation. ACM Sigact News, 27(1):27–29, 1996. doi:10.1145/230514.571645.

[bib.bib42] [42] Xinying Song, Shilin Ding, and Chin-Yew Lin. Better binarization for the CYK parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 167–176, 2008. URL: https://aclanthology.org/D08-1018/.

[bib.bib43] [43] Shanjiang Tang, Ce Yu, Jizhou Sun, Bu-Sung Lee, Tao Zhang, Zhen Xu, and Huabei Wu. Easypdp: An efficient parallel dynamic programming runtime system for computational biology. IEEE Transactions on Parallel and Distributed Systems, 23(5):862–872, 2011. doi:10.1109/TPDS.2011.218.

[bib.bib44] [44] Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and Rezaul A Chowdhury. Cache-oblivious wavefront: Improving parallelism of recursive dynamic programming algorithms without losing cache-efficiency. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 205–214, 2015.

[bib.bib45] [45] Leslie Valiant. General context-free recognition in less than cubic time. J. Comput. Syst. Sci., 1974.

[bib.bib46] [46] F Frances Yao. Efficient dynamic programming using quadrangle inequalities. In Proceedings of the twelfth annual ACM symposium on Theory of computing, pages 429–435, 1980. doi:10.1145/800141.804691.

[bib.bib47] [47] Daniel H Younger. Recognition and parsing of context-free languages in time $n^{3}$ . Information and control, 10(2):189–208, 1967. doi:10.1016/S0019-9958(67)80007-X.

On the I/O Complexity of the Cocke-Younger-Kasami Algorithm and of a Family of Related Dynamic Programming Algorithms

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Previous and related work.

Summary of results.

2 Preliminaries

Computational Directed Acyclic Graphs.

I/O model.

3 CDAG construction

Analysis of CDAG properties.

Lemma 1.

4 I/O analysis for computations of CDAGs in 𝓖⁢(𝒏)

4.1 I/O lower bound for computations with no recomputation

Lemma 2.

Proof.

Lemma 3.

Proof.

Theorem 4.

Proof.

On the tightness of the bound

4.2 I/O lower bound for computations with allowed recomputation

Definition 5 ([31]).

Lemma 6.

Proof.

Theorem 7.

Proof.

On the tightness of the bound.

Theorem 8.

Proof.

4.3 I/O analysis for notable Dynamic Programming algorithms

5 I/O analysis for the Cocke-Younger-Kasami algorithm

5.1 Background on the CYK algorithm

Definition 9 ([41]).

Memory representation model.

5.2 I/O lower bound

CYK CDAG construction.

Lemma 10.

Proof.

I/O lower bound proof.

Lemma 11.

Proof.

Lemma 12.

Theorem 13.

5.3 On the tightness of the bounds

Theorem 14.

Proof.

6 Conclusion

References

4 I/O analysis for computations of CDAGs in $\mathcal{G}(n)$