Higher-order Demand-driven Program Analysis

Developing accurate and efficient program analyses for languages with higher-order functions is known to be difficult. Here we define a new higher-order program analysis, Demand-Driven Program Analysis (DDPA), which extends well-known demand-driven lookup techniques found in first-order program analyses to higher-order programs. This task presents several unique challenges to obtain good accuracy, including the need for a new method for demand-driven lookup of non-local variable values. DDPA is flow- and context-sensitive and provably polynomial-time. To efficiently implement DDPA, we develop a novel pushdown automaton metaprogramming framework, the Pushdown Reachability automaton. The analysis is formalized and proved sound, and an implementation is described.


INTRODUCTION
Developing an accurate but efficient higher-order program analysis is hard. Building on older firstorder abstract interpretations [8], a large array of higher-order analyses have been developed over the past 20 years [11, 22, 24, 32-35, 44, 51, 52]. Unfortunately, in spite of all the technical advances, these analyses are not commonly employed in compilers for higher-order functional languages today: MLton [54] and Stalin [45] are among the few whole-program optimizing compilers that use these tools. Even in those cases, the analyses are often used in their less-expressive versions. The primary reason is the difficulty in getting both expressiveness and efficiency in the presence of higher-order functions due to their significantly greater complexity than the first-order case.  analysis in Section 6. The decidability of the analysis is presented in Section 7. In Section 8, we present the theory of PDR systems as an efficient technique for implementing the analysis.
In the theoretical presentation, we focus on the pure call-by-value λ-calculus to not get overwhelmed by details; in Section 9, we outline how additional language features may be incorporated. In Section 10, we evaluate our implementation of this more featureful language in terms of running time and precision.
We discuss related work in Section 11 and conclude in Section 12. This article is extends the original conference paper on DDPA [36] with a PDR-based reference implementation. Sections 3, 8, and 10 are new. Extended explanations and a more complete theoretical analysis, including a proof that the unbounded-stack ωDDPAc analysis is a full and faithful λ-calculus interpreter, is also included.

ANALYSIS OVERVIEW
This section informally presents DDPA by example. We begin with a high-level description of the analysis and roughly sketch how it works on a small program. This description is incomplete: We will show how the analysis as described is too imprecise, and we will then describe another feature of the analysis that in fact addresses this imprecision. This process will be repeated several times until the entire DDPA algorithm has been described. This section does not touch on implementation; Section 3 will describe at a high level how DDPA may be efficiently implemented.

A Simple Language
We use a simple functional language defined in Figure 1. It is a call-by-value λ-calculus extended with conditionals (via pattern matching), records, and state (via ML-like reference cells). Conditionals are written x~p ? f 1 :f 2 : either f 1 or f 2 is called with x as argument, depending on whether x's value matches the pattern p. We require that programs are closed and that variable identifiers are unique ("alphatized"). We use an A-normal form (ANF) [17] intermediate representation, so a clause c denotes a program point. The operational semantics for this language is straightforward and deferred to Section 5.1.

The Basic Analysis
Consider the program in Figure 2 and the steps to construct its Control-Flow Graph (CFG) in Figure 3.
program know that actual execution would never output the latter, because the execution path leading to it does not make sense: It includes (the entrance node for ) and (the exit node for , which means a function is called at and returns to . The lookup described thus far does not rule out this misalignment and non-deterministically explores all possible paths, over-approximating the answer and losing precision. To resolve this, we introduce another component to the lookup process: the context stack. The lookup question changes to the form "What are the possible values for variable x at with context stack [a, b, c]." We push the call site name to the context stack when we visit an exit node for a function call (in reverse), and we only visit the entrance node for which we can pop the corresponding call site. For example, when we visit , we push n to the context stack, and, when we are at , we only follow the path to and discard the path to , because n is on the top of the stack. Then the final result will be exact: "only is a possible value for n by the end of the program." Numerous higher-order analyses have been developed that show how higher-order functions can be analyzed, and some of those analyses also include call-return alignment mechanisms [11,19,24,37,51]. But the call-return alignment in these analyses is not fully solving the higher-order call-return alignment problem: They are more or less porting the first-order notion of call return alignment over and, since first-order programs have no non-locals, these higher-order analyses only align local variables. For example, PDCFA [24] and CFA2 [51] only align the so-called stack references (locals), not the so-called heap references (non-locals). As a result, forward higher-order analyses cannot rely solely upon stack alignment to achieve the effect of polymorphism; an additional polymorphism mechanism must be employed. This additional polymorphism machinery often generates redundant work (e.g., in cases where no non-local variables were used and callreturn alignment would be sufficient) but must be conservatively applied to maintain precision. A key contribution of our previous work [36] is a method for alignment of non-locals in call-return alignment via the context stack discipline, which also makes DDPA a context-sensitive (polyvariant) analysis. This feature is commonly achieved by copying function bodies-kCFA being the canonical example [35]-but, in DDPA, context-sensitivity relies solely on call-return alignment. The subtleties of how this works are covered next.

Non-Local Variable Lookup
The lookup process described so far only included local variables (stack references). We now address non-local variables (heap references). Consider the program in Figure 4. We build its CFG as described in the previous section ( Figure 5) and start looking up for the values of r from the end of the program. There is only one call to each function in the program, so we ignore the context stack and the call-return alignment issue for the rest of this example. We first move to , changing the lookup subject to b; then we go to , changing the lookup subject to x. At this point, the lookup process described so far would fail to abstract the notion of lexical scoping and instead look the for non-local variable in the context of the call site, as opposed to the scope in which the function was defined: It would skip , , and , because they do not immediately affect the value of our subject x, reach the beginning of the program without finding a value, and fail. The search process needs to be modified to take lexical scoping rules of non-local variables into account.
The solution to this issue is inspired by access links used in compilers for higher-order functions. When visiting in the lookup above, we notice that x is not the function argument, so it must be a non-local variable that was in scope at the function definition, not its use. To properly abstract the notion of lexical scoping, we must defer the current lookup, find the program point defining the  function we are exiting (in reverse), and then resume from there. In our example, DDPA executes this by suspending the lookup for x and starting a subordinate lookup for the function we are exiting, p.
The subordinate lookup for p follows the rules we discussed thus far, entering . The subject of the lookup changes to a, which we find on the next step at . This is the definition of the function we were looking for, so we can resume the deferred lookup of x from there: We move to , change the subject to e, move to , and, finally, find the answer. The function definition in the subordinate lookup could be a non-local itself, so, in general, it is necessary to chain on the above idea, akin to access links. This requires the introduction of a second stack, besides the context stack discussed above, which we call the lookup stack. The lookup stack contains the deferred lookups, and the lookup process is complete only when the lookup stack is empty. There are other uses for the lookup stack in DDPA; for example, to lookup a record projection, the analysis has to lookup the record and then the value under the projected key. DDPA keeps the projected key in the lookup stack from the program point containing the projection until the record is found. Because of this generality, we also refer to the lookup stack as a continuation stack.

Function Call Lookup
The lookup process described thus far still loses precision on function calls. Consider the program in Figure 6, the corresponding CFG in Figure 7, and a lookup for q from the end of the program with an empty context stack. On the first step, DDPA visits , changes the lookup subject to b and finds three possible paths to continue. The first is , which it discards, because it would mean skipping over a call site that contains information about the lookup subject. On the second path, it visits , , and finds the expected answer: . But, on the third path, it visits , , and finds an imprecise answer: . This third path does not make sense during runtime: It requires the call site to pass h as the value of f but then expects f to be g at . The call-return alignment discipline we described thus far is not enough to eliminate this path, because we reached the answer at by visiting only immediate and exit nodes, so we pushed call sites to the context stack but never visited entrance nodes to try (and fail) pop them.
The solution here is to start a subordinate lookup for operand at the call site before visiting an exit node and only proceed if the result includes the pertinent function. This subprocess shows that our call-return alignment discipline aligns not only the entrance and exit nodes but also arguments, even those that are higher-order functions. In our example, before we visit we have to verify whether g can flow into f. This is not the case, because q is on the top of the context-stack, and  the call-return alignment discipline fails when trying to pop p at , so we discard this entire path. But the condition is satisfied for , part of the path leading to the more exact answer.

Recursion and Decidability
We now address how the analysis handles recursive functions, which induces cycles in the CFG. A lookup based on graph traversal as we have thus far described would never end, following the cycle indefinitely. Consider the program in Figure 8, the corresponding CFG in Figure 9, and a lookup for the values of r by the end of the program with an empty context stack. DDPA starts by visiting and then . From there, it can repeatedly follow the cycle back around and never reach an answer, which is unsurprising as the runtime for this program also does not terminate. Moreover, the lookup process described thus far includes two stacks, the context stack and the lookup stack, and with them we could simulate the tape of a Turing Machine, so it looks very unlikely to be undecidable with two un-finitized stacks. We concretely prove an undecidability result in Section 5: We define ωDDPAc, a two-stack-unbounded analysis, and prove it is a complete interpreter for the call-by-value λ-calculus.
To make lookup decidable, we have to start by artificially bounding one of these stacks: If we do it for the lookup stack, then we impact the precision on non-local variables, record projection, and so forth; and a bounded context stack would impact context-sensitivity. We choose the latter. Any systematic finitization technique of the context stack would work, and in DDPA we use one of the simplest possible: We truncate the stack to a fixed maximum size of k, in the spirit of kCFA [44]. This characterizes kDDPA as a family of program analyses, parameterized over the choice of k. Other finitization techniques lead to different tradeoffs with respect to precision and performance, but they are orthogonal to the demand-driven aspects that we are exploring in this work.
In our example, kDDPA would follow the loop around a maximum of k times, at which point it would exhaust the context stack. When this happens, a subordinate lookup question will be identical to a lookup question already underway: "What are the possible values for a at with context stack [a, a, ...]?" In this case, the subordinate lookup immediately returns the empty set, because there are no other paths contributing to the answer.

Reachability
We described the lookup process in terms of a traversal of the CFG, but this is not how the algorithm is actually implemented. To realize the analysis in a way that promotes the reuse of  previously computed answers, lookup is encoded in terms of Pushdown System (PDS) reachability questions [6]. The answer to these questions, in turn, are lazily calculated with a novel construction called PDR. The next section describes these automata and gives an overview of how DDPA can be efficiently implemented.

IMPLEMENTATION OVERVIEW
We now give a high-level overview of how lookup in the previous section is efficiently implemented.

The Basic Analysis-Revisited
In the previous section, we introduced lookups as (reverse) traversals on the CFG, but the cycles induced by recursive functions could be traversed indefinitely, rendering this intuition unrealizable. DDPA's solution to this issue comes in three parts: (1) encode the CFG traversals in terms of a two-stack pushdown automaton; (2) derive a one-stack pushdown automaton that approximates it; and (3) interpret lookup queries as reachability questions on this automaton, a task known to be decidable [6].
To begin with, we construct a two-stack pushdown automaton (two-stack PDA) corresponding to the CFG traversals from Section 2: Nodes in this PDA correspond to nodes in the CFG, edges correspond to transitions allowed by following control flow in reverse, and edge annotations correspond to stack manipulations-thick arrows (⇓/⇑) are related to the context stack and thin arrows (↓/↑) to the lookup stack. Since we are searching backwards through the program, we encounter function returns before the calls and so the context stack stack operations are pushing returns and popping calls. (This can take some getting used to as it is the exact opposite of the program runtime stack operations.) The edge to the start state in the PDA represents the start of the traversal in the CFG by pushing the query subject to the lookup stack, and immediate nodes are accepting states. The first outstanding characteristic of this PDA, which is also true of all other automata we will encounter in this article, is that its input alphabet is empty, because our purpose is not to recognize a string but to reason about the stack discipline. Automata of this kind are called PDS [6].
As an example, consider the PDS in Figure 10, which illustrates lookups of n and m from the end of the given program in the CFG from Figure 3. The edge labeled ↓n on the upper-right corner starts the traversal looking for n from the end of the program, the node represents visiting that node in the CFG traversal, and the node is an accepting node. The main row on the top of the PDS in the figure corresponds to the complete traversal caused by looking n up from the end of the program, and the row below corresponds to the traversal caused by looking m up from the end of the program. The automaton permits sharing the nodes that are identical in both traversals, for example, . The paths from ↓n to and from ↓m to are realizable, but from ↓n to and Fig. 10. A two-stack PDS encoding traversals for lookups of m and n from the end of the program in Figure 3. Thick arrows represent pushes (⇓) and pops (⇑) to the context stack, and thin arrows represent pushes (↓) and pops (↑) to the lookup stack. Fig. 11. A one-stack PDS that approximates the two-stack PDS by finitely abstracting the context stack and embedding it in the nodes.
from ↓m to are not, because they make incompatible choices when popping from the context stack, preserving the call-return alignment described in Section 2.2.
It is well known that a two-stack PDA is equivalent in power to a Turing Machine, so we are clearly flirting with danger here, and concretely we show in Section 5 that ωDDPAc, the twostack-unbounded analysis, is Turing-complete and thus undecidable. So we must convert the twostack PDS into something strictly weaker to make the analysis computable. DDPA's approach is to finitely abstract the context stack configurations and embed them in the PDS nodes and, if multiple context stack configurations can occur at the same CFG node, create multiple versions of that node in the PDS to distinguish them. Any abstraction that bounds the context stack into a finite space suffices, and different choices have different effects on the analysis' capacity to align calls and returns. For simplicity DDPA uses a very basic abstraction technique, similar in spirit to the contour treatment in kCFA [44]: Truncate the context stack beyond length k to create a family of increasingly precise kDDPA analyses. See Figure 11 for the 1-stack PDS for our running example, in which the node is duplicated and differentiated with two context stacks, because we choose k = 1. As illustrated by , for example, node sharing between paths can still occur in this PDS, but it is more rare, because the traversals have to visit the common node with the same context stack configuration.
Both the CFG and PDSs constructions can be incremental, responding to the demands of the query, because the information discovered is never invalidated at later steps. Moreover, cycles in the CFG induce cycles in the PDSs, but their construction is finite and decidable, because both CFG and PDSs grow monotonically and in tandem, and nodes and edges cannot be repeated.
Finally, after constructing the PDS, we can use it to answer lookup questions by encoding them in terms of reachability questions of the form "which accepting states (immediate nodes) can we reach from this given start state with an empty lookup stack (the context stack embedded in the node might be non-empty)?" In our example, the accepting state is reachable when starting from ↓n, but is not. In the literature, several different approaches have been used to refine program analyses via aligning calls and returns: context-free language (CFL) reachability [40], set constraint solving [25], PDA reachability [3], and logic programming specifications [42] can all be used. The DDPA implementation is not aiming for perfect alignment of calls and returns like these analyses, because it instead focuses on perfect access of non-local variables and uses the PDS stack for that purpose;  the call stack is always a finite approximation in the nodes of the PDS and so can be viewed more as having a stack-free finite automaton representation in DDPA. This follows the dominant stream of higher-order analyses including kCFA and its many refinements-the k represents the fixed finite length of the call stack. As mentioned above, there are many other possibilities for abstraction other than the simple one used here; for example, in code with no recursive non-locals we could in principle swap and make the call stack arbitrary and the non-locals stack of bounded size. Some higher-order forward analyses opt for perfect precision on the call stack but then lose precision on non-locals [19,24,51]. The notion of finding good tradeoffs between data and control precision is studied in Reference [55]. The above analysis aligns both local and non-local variables, and it is achieving the full effect of kCFA-style polymorphism; this is apparent in the precision of the above example. So while the analysis algorithm is closely related to the CFL-reachability analyses, the effect in the higher-order space is more closely related to polyvariance than stack alignment.
Finitization of the call stack is not an optimal solution: There are exponentially many finitized call stacks of length k, so the algorithm is exponential as k grows. 1 Also, this approach will never be able to keep perfect call-return alignment for recursive programs on any k. Future work is to improve on this approximation method.

Pushdown Reachability Automata
kDDPA depends on solving pushdown reachability, for which the standard algorithm is to perform an edge closure [6]: continually replace adjacent matching push and pop edges with a single transitive no-op edge. For any valid path in the original automaton, after closure there will be a path consisting solely of no-op edges. Pushdown reachability is decidable in polynomial time [6], which is a pleasant property, and the proof-of-concept implementation of DDPA [36] used this algorithm to perform variable lookup. But, while straightforward, this approach can be slow in practice when the number of states and edges is large. Moreover, most of them do not affect the lookup result, so we define a novel PDR automaton, partly inspired by PDCFA [24], to accelerate reachability queries.
Observe that, over the course of kDDPA, the CFG grows monotonically. As a result, the variable lookup PDS (and its closure) exhibits similar monotonic growth. As long as we grow the PDS in tandem with the CFG throughout the course of the analysis, we can use the same data structure for all variable lookups and avoid duplicating PDA closure work. Moreover, we can identify structural similarities between edges and compress them to reduce the size of the automaton representation. These monotonic properties and this compression technique will be crucial in our implementation of a tractable analysis.
We demonstrate this different form of PDS using the code in Figure 12. Consider the task of looking up the values of x and y from the end of the program, using the same PDS each time. The PDS for this program appears in Figure 13, and PDR appears in Figure 14. The only difference between these automata is that the PDR ( Figure 14  instance, the edge labeled "↑↓¬y". The PDR treats this edge as a no-op as long as the top of the stack is any variable other than y. This means, for example, that the path is valid in this PDR, but the path is not. These so-called dynamic pop edges come in a variety of forms and represent PDS transition schemas; the single ↑↓¬z edge, for instance, replaces an appropriate pop/push edge pair for every variable in the program (excepting z)-the pop immediately followed by a push of the same non-z subject amounts to a no-op.
In general, the PDS that embodies the above lookup algorithm contains patterns of states and edges that are dictated by the execution semantics of the language. We use this and other forms of dynamic pop edges to represent these patterns directly rather than encode them, leading to a much smaller automaton and therefore faster lookup process. In Section 8, we generalize this notion of dynamic pop edge and use them to efficiently implement a form of primitive computation for the more complex rules of the analysis.

THE ANALYSIS
In this section, we formalize the DDPA analysis. To simplify presentation, we restrict ourselves to an A-normalized [17] lambda calculus; we outline how additional language features can be introduced in Section 9. The operational semantics of the language is eager and standard, and we postpone it and the soundness proof to Section 6.
The grammar constructs needed for the analysis appear in Figure 15. Items on the left are just the hatted versions of the corresponding program syntax. Functions are the only data type in the simplified language. In the analysis, closure environments are subsumed by our treatment of nonlocal variables; function values are then represented in the abstract by their bodies alone. Recall from above that we require variables to be bound uniquely (so-called "alphatised" or "uniquized" variables). We also have the common requirement that analyzed expressions are closed: A variable is not used until after a clause in which it is bound.
Edgesд in a control flow graphĜ are writtenâ <<â , and mean clauseâ happens right before clauseâ . New clause annotations ĉ / ĉ are used to mark the entry and exit points for functions. The Start node is a special node placed at the very start of the program and similarly for End. These nodes are needed if any wirings are placed around the first or last program clause.

Lookup
As was described in Section 2, the analysis will search back along << edges in the graphĜ to find the definitions of variables it needs. We now define this lookup function.

Context Stacks.
The definition of lookup proceeds with respect to a current context stack C. The context stack is used to align calls and returns to rule out cases of looking up a variable based on a non-sensical call stack and was described in Section 2.2.
The proof of decidability relies upon bounding the depth of the call stack. We first define a general call stack model for DDPA, and in Section 7 below we instantiate the general model with a fixed k-depth call stack version notated kDDPA; this is a simple bounding strategy and our model can in principle work with other strategies. (1)Ĉ is a set. We useĈ to range over elements ofĈ and refer to suchĈ as context stacks.
(2) ϵ ∈Ĉ. Generally, the context stack is an approximation of the program's runtime call stack. The Push and Pop function derive new context stacks upon calls and returns, and the MaybeTop predicate determines whether the top of the runtime call stack may be a call from siteĉ. Models err on the side of overapproximating MaybeTop for soundness. The distinguished context stack ϵ signifies a lack of any context information (and not an empty call stack): Popping from ϵ yields ϵ and MaybeTop(ĉ,Ĉ) is always true for any call siteĉ.
A natural family of context stacks is one that retains up to k top stack frames; to also admit the unbounded case we let k range over Nat ∪ ω for ω the first limit ordinal. We let [ĉ 1 , . . . ,ĉ n ] k denote [ĉ 1 , . . . ,ĉ m ] for m = min(k, n). Definition 4.5. For every k ∈ Nat ∪ ω, we define context stack model Σ k to haveĈ contain the set of all lists of up to length k of clausesĉ occurring in the program. We define the remainder of Σ k as follows: • MaybeTop(ĉ , [ĉ 1 , . . . ,ĉ n ]) is true ifĉ =ĉ 1 or if n = 0; it is false otherwise.
We use the term "kDDPA" to refer to DDPA with context stack model Σ k .
As above, note that ϵ = [] reflects a lack of knowledge of the call stack and not necessarily a lack of stack frames.

Lookup Stacks.
Lookup also proceeds with respect to a lookup stackX . The topmost variable of this stack is the variable currently being looked up. The rest of the stack is used to remember non-local variable(s) we are in the process of looking up while searching for the lexically enclosing context where they were defined.
Unlike the context stack above, the lookup stack is unbounded: The process of looking up a non-local could trigger another non-local lookup of a non-lexically enclosing function, so there is no lexical upper bound on the depth of this stack in the general case. Though no finite bound exists on this stack's depth, every lookup stack is still finite in size.
Also unlike the context stack, there is no graceful way to approximate when lookup stack information is lost. So, we must preserve the whole stack in the analysis. Section 2.3 gave motivation and examples for non-local variable lookup.

Defining the Lookup Function.
Lookup finds the value of a variable starting from a given graph node. Given a control flow graphĜ, we writeĜ (X ,â 0 ,Ĉ) to denote a lookup using stackX in G relative to graph nodeâ 0 with contextĈ. For instance, a lookup of variablex from program point a with unknown context would be writtenĜ ([x],â, ϵ ). Note that this refers to looking for values ofx upon reaching program pointâ but before that point is executed (much like the convention of interactive debuggers); we are looking for a definition ofx in the predecessors ofâ but not within a itself.
Note this is a well-formed inductive definition by inspection. Each of the clauses above represents a different case in the reverse search for a variable. We now give clause-by-clause intuitions.
(1) We finally arrived at a definition of the variablex and so it must be in the result set.
(2) The variablex 1 we are searching for has a function value and, unlike clause (1), there are more variables on the stack. This occurs because clause (5), described below, needed to look up the next variable,x 2 , in the place wherex 1 was defined (as in non-local lookup). Now that we have foundx 1 , we remove it from the lookup stack and resume the lookup ofx 2 .
(3) We have found a definition ofx but it is defined to be another variablex . We transitively switch to looking forx . (4) We have reached the start of the function body and the variablex we are searching for was the formal argumentx . So continue by searching forx from the call site. The MaybeTop clause constrains this stack frame exit to align with the frame we had last entered (in reverse). (5) We have reached the beginning of a function body and did not find a definition for the variablex. In this case, we switch to searching for the clause that defined this function body, which leads us to pushx f onto the lookup stack. Once the defining point ofx f is found, we will pop it and resume looking forx (see clause (2)). The MaybeTop clause constrains the stack frame being exited to align with the frame we had last entered (in reverse). (6) We have reached a return copy that is assigning our variable x, so to look for x we need to continue by looking for x inside this function. Pushĉ on the stack, since we are now entering the body (in reverse) via that call site. For a more accurate analysis, the "provided" line additionally requires that we only "walk back" into function(s) that could have reached this call site; so, we launch a subordinate lookup ofx f and constrainâ 1 accordingly. (7) Here the examined clause is not a match, so the search continues at any predecessor node.
Note this will chain past function call sites that did not return the variablex we are looking for. This is sound in a pure functional language; when we address state in Section 9.4, we will enter such a function to verify an alias to our variable was not assigned.

Abstract Evaluation
We are now ready to present the single-step abstract evaluation relation that incrementally adds edges to the control flow graph. This system has some parallels with a graph-based notion of evaluation [27,53], but in our system function bodies are never copied-a single body is shared.

Active Nodes.
While evaluation is abstract and graph based, it shares some features with standard evaluation: There is an evaluation context [16] of the already-evaluated "expression" (here a graph) and we need to next evaluate the current "redex," which here we call the active node. In particular, only nodes with all previous nodes wired-in can fire. Definition 4.7. Active?(â ,Ĝ) iff path Start <<â 1 << · · · <<â n <<â appears inĜ such that nô a i is of the formx=x x . We write Active?(â ), whenĜ is understood from context.

Wiring.
Recall from Section 2 how function application required the concrete function body to be "wired" directly in to the call site node and how additional nodes were added to copy in the argument and out the result. The following definition accomplishes this.
c here is the call site, andĉ 1 << · · · <<ĉ n is the wiring of the function body. The Preds/Succs functions reflect how we simply wire to the existing predecessor(s) and successor(s).
Next, we define the abstract small-step relation −→ 1 on graphs. With the above preliminaries, this is easy to define: For each reachable function application with a particular function and argument, we add wiring nodes to copy the argument into the function and copy its return value to the call site.   We define the small step relation −→ 1 to hold if a proof exists in the system in Figure 16. We writeĜ 0 −→ * Ĝ The A-normalized lambda calculus only requires an application rule, and languages with additional control flow constructions (such as the language of Section 9) will need additional rules.
The next sections show the formal properties of the above analysis. Section 5 proves undecidability with a non-finite context stack model. Section 6 demonstrates the soundness of DDPA with respect to a standard small-step operational semantics, while Section 7 proves DDPA to be decidable by reducing the lookup procedure to a PDS reachability problem.

A GRAPH-BASED OPERATIONAL SEMANTICS
In this and the following sections, we show the soundness of DDPA. We do so in a fashion common to higher-order program analyses [30,44,51]: We prove a standard operational semantics equivalent to a non-standard one and then show the analysis is a sound abstract interpretation [8] of the non-standard semantics. For clarity of presentation, our translation of the operational semantics moves through two intermediate systems as illustrated in Figure 17.
The operational semantics we start with is closure-based; we take this as ground truth as it is well known to be equivalent to other operational semantics for the call-by-value λ-calculus. The target non-standard operational semantics, which we term ωDDPAc, is a graph-based operational semantics that is nearly identical in form to DDPA as presented in Section 4 but is still a full and faithful interpreter for the CBV λ-calculus.
We believe ωDDPAc is a unique and interesting presentation of operational semantics in its own right that may independently have other applications. It lacks closures, substitution, fresh variables, or an environment, and reductions are all polynomially bounded in length, a very surprising feature for a Turing-complete interpreter. (This bound may initially seem paradoxical, but the reduction steps themselves are undecidable. We discuss this further in Section 5.5.) We dedicate this section of the article to demonstrating the equivalence of the ground-truth semantics and ωDDPAc; soundness of DDPA with respect to ωDDPAc is then shown in Section 6.

Closure-based Operational Semantics
We begin by defining an environment/stack/closure-based operational semantics for the λ-calculus. This is not far in spirit from a CEK machine [15]. The grammar of our language appears in Figure 18; this is simply the grammar from Figure 15 with hats removed (or the grammar from Figure 1 without records, pattern matching, or state). We additionally define the grammar of    Figure 19 for use in our operational semantics: Environments E are mappings from variables to closures, closures κ are pairs of functions and environments, and evaluation states ϕ are a stack of pairings between environment and instructions to be executed.
As in Section 2, we restrict all variable bindings throughout a particular program to be unique for convenience. We define RV(e) similarly to Definition 4.2 to return the last variable defined by an expression.
We define the closure-based small step operational semantics as a relation ϕ −→ 1 ϕ as follows: The rules in Figure 20 are largely straightforward. Each rule acts upon the topmost element of the ϕ stack. The Definition rule, upon encountering an assignment of function f to variable x, will add the binding x → f , E to the environment, where E is a copy of the previous environment representing the non-local variables of f . The Alias rule is similar, copying the current value of x 2 from the environment into a new binding for x 1 . The Call rule pushes a new frame onto the ϕ stack in which to execute the function's body; the Return rule pops a completed frame and updates the caller's environment with the functions return value that, as in Definition 4.2, we take to be the last assignment in the function's body.
Note that, throughout evaluation, the only variables added to an environment E on the stack are those from the function's closure and from the function body. Due to the unique variable requirement of expressions, no variable can be defined in both; as a result, each x → κ within an environment maps a distinct variable.

A Stackless Operational Semantics
We now begin taking the steps toward ωDDPAc as outlined in Figure 17. Each step toward ωDDPAc makes some aspect of the system more demand driven (while still maintaining call-by-value evaluation semantics). In this first step, we define an operational semantics that stores binding information in a flat historical log rather than in a stack. In addition to bindings, this log stores events in which stack frames are pushed and popped, so it can fully replace the environments E of the previous system.
The grammar for the stackless system appears in Figure 21. An evaluation state is a pairing between a log L and a stackless expression I ; note that every e is of form I . In addition to clauses, stackless expressions may include annotated assignments indicating when functions return. When functions are called, is added to the log to record the event; when a function returns, this is recorded with . As a result, the log L stores all bindings throughout the execution of program (even those that are no longer in scope); we then define a function to extract an environment from a provided log that will skip over any variables in functions that have already returned.
This function traces the log backwards, building the environment E by finding each binding that occurred during the call to this function or one of its calling ancestors. The second argument is a number counting intermediate function calls to ensure that non-closure-captured bindings from within previously called functions are not included in the resulting environment.
With this extraction function, we can define the stackless operational semantics. We overload the −→ 1 operator as follows: Definition 5.3. L; I −→ 1 L ; I holds if a proof exists in the system of Figure 22. We write L 0 , I 0 −→ * L n , I n iff L 0 , I 0 −→ 1 · · · −→ 1 L n , I n .

Proof of Equivalence.
We now demonstrate the equivalence of the above operational semantics to the one in the previous section. This is accomplished by establishing a bisimulation between the original program states ϕ and the stackless program states L; I . We formalize this bisimulation as follows: Proof. By case analysis on the rule used. In particular, each rule in Figure 20 aligns with each rule in Figure 22 such that the premises of a rule can be proven by the premises of its counterpart and the properties of the bisimulation.
In the above proof, the only non-trivial steps exist between the Call and Return rules. First, the systems differ in how they represent a call in progress (an existing call site on the stack in the original system and an annotation in the stackless system). Second, we must be able to demonstrate that ExtractEnv correctly describes the environment E both when a new function is called and when a running function returns. The latter relies upon the and symbols appearing in the binding log; that these annotations are present is a consequence of the inductive property of the bisimulation described above.

An Operational Semantics with Lazy Lookup
Our next step toward ωDDPAc is to define an operational semantics that looks up the value of variables on demand rather than eagerly constructing bindings. This lazy lookup operation is starting to get close to DDPA's lookup operation (Definition 4.6): like DDPA, it traces backward through the program to reconstruct bindings as needed, including reconstruction of the context of a function's closure when needed.
We define the grammar we require for this system in Figure 23. Unlike the previous systems, our program state for this operational semantics will simply be an expression W with no explicit environment. Although we provide a grammar of environments Z , this is primarily used to describe the first unevaluated call site of the expression W .
Lazy lookup is defined in function Z (X , n) as follows. We use X as a stack of variables in a fashion similar to DDPA's Definition 4.6. The integer n serves a purpose similar to Definition 5.2: to skip over bindings no longer in scope.
Definition 5.6. For a given environment Z , we define the lookup function Z (X , n) as follows: We write Z (X ) to mean Z (X , 0).
As with the environment extraction function in Definition 5.2, the integer here is used to disregard variables that were bound in calls that have since completed. One key difference in these definitions is that Definition 5.2 stops its work upon reaching a symbol with n = 0 (which indicates that we have left local scope), whereas Definition 5.6 continues past the with n = 0. In the eager stackless system of Section 5.2, we copied the closure of each function into place immediately after the start-of-call symbol . In this lazy system, we will not; instead, for non-local variables captured in closure, clause 5 will first identify the point in time at which the closure was captured and then continue lookup from that point.
This lookup definition is similar to Definition 4.6 of DDPA. The most notable differences are the absence of a context parameter and the presence of the n parameter. The former is not necessary here as context can be established from a traversal of Z . The latter is required here but not in the analysis, because the program point immediately following a function call in DDPA has at least two predecessors-the call's wiring nodes and the call node itself-while the list Z may only have one. The next section defines an operational semantics to bridge this gap.
Given the above lazy lookup function, we can define an operational semantics as follows: Definition 5.7. W −→ 1 W holds if a proof exists in the system of Figure 24. We writeW 0 −→ * W n iff W 0 −→ 1 · · · −→ 1 W n . Note that Figure 24 contains only one evaluation rule, lining up closely with DDPA ( Figure 16) and considerably simplifying the four rules from the previous systems. The previous Definition and Alias rules are obsolete here due to the lazy manner in which lookup occurs and the fact that we no longer construct explicit closures. The Call and Return rules have been grouped into a single Application rule; this is also possible due to lazy lookup, as we no longer need to process the exit annotation when the function returns.

Proof of Equivalence.
We now demonstrate that the operational semantics just defined is equivalent to the stackless semantics defined in Section 5.2. As in Section 5.2.1, we demonstrate this via a bisimulation between states of the two systems. This bisimulation is somewhat more subtle, however, as we must first align the eager and lazy environments and then align evaluation states.
To describe the relationship between the systems' environments, we overload the notation to describe an alignment between eager environments E (which are generated in the stackless system by ExtractEnv) and pairs of lazy lookup environment Z and variable stack X . It is necessary but not sufficient to require each binding in E to match the results of lookup on Z and vice versa; to correctly handle higher-order functions, we must also ensure that closures are correctly represented. The variable stack X in this bisimulation describes the sequence of lookups necessary to reach the point where a particular closure is defined. We thus write this bisimulation as follows: Definition 5.8. We write E Z ; X to mean: That is, an eager function lookup aligns with a lazy function lookup if the alignment property applies recursively to the eager function's closure. The additional x is used to continue to describe a path through Z to the point at which the function's closure is defined. This definition is well founded, because the first defined function will always have an empty closure, making the bisimulation property for that function trivial.
Given a means by which environments can be aligned, we can then define the bisimulation between the stackless and lazy lookup systems: Definition 5.9. We write L; I W to mean: • W = Z || W for the largest possible Z ; that is, the first element of W is the first application in W (or W is empty). • I = W ; that is, the list of unperformed work is the same.  This lemma displays an asymmetry that hints at a difference between the two systems: Intuitively, the stackless system takes smaller steps than the lazy lookup system. The only step in the lazy system is application; the definition clauses that appear as a result, for instance, are implicitly processed by lazy lookup later and upon demand. In essence, the eager system may need to take many steps to "catch up" to the lazy system's state. Likewise, a single step in the eager system may not align directly with the lazy system, but the eager system will eventually catch up to the single step taken by the lazy system by processing any definitions, aliases, and so on, which the lazy system deferred.
The initial bisimulation is not immediate but is relatively easy to prove using the above reasoning: The starting expression may have to take a few steps to catch up to the initial state of the lazy system, but a bisimulation is provable upon reaching the first application (or the end of the program). In this lemma note that all e are of form W . The equivalence of the stackless and lazy systems then follows directly by induction on computation length using the above two lemmas.

ωDDPAc: A Graph-Based Operational Semantics
We now present ωDDPAc, our final operational semantics, and prove it equivalent to the lazy lookup system just defined. ωDDPAc is a graph-based operational semantics that represents expressions as concrete (runtime) control flow graphs rather than lists with a fixed point of execution. It only differs from the analysis of Section 4 in two ways: The calling contexts are fixed to be the full call stack without approximation (the ω in the name), and the wiring rule is refined compared to DDPA to take the current context into account (the additional "c" at the end of ωDDPAc).
We define the grammar of ωDDPAc in Figure 25. This grammar is structurally very similar to the analysis. The only difference is that we are using a list of contexts C as opposed to a general context stack model Σ. We use notation similar to Definition 4.1 for these graphs; for instance, we sometimes write a < a to mean (a << a ) ∈ G for some G understood from context.
In comparison to the lazy lookup system of Section 5.3, the key difference is that, rather than representing execution as a listW , this system uses a graph G. The nodes of the graph are individual program clauses and the special nodes Start and End, representing the start and end of the overall program. Edges in the graph will represent control flow that occurs at least once during execution. Intuitively, if we evaluate the same program in each system in lock-step and pick a moment during evaluation, then the lazy lookup evaluation state W will correspond to a path in the ωDDPAc state G.
A consequence of this model is that, for any program, the set of possible control flow decisions (here, graph edges) is finite and monotonically increasing. (We discuss this further in Section 5.5.) However, the number of program states (here, paths in the graph) may not be. A loop, for instance, manifests as a repeated pattern in W but as a cycle in G. As a result, a particular node in the graph is insufficient to describe the current state of execution: We must also be able to identify how we reached that point. This is achieved by the use of a stack of clauses C that, for the simple language in this section, corresponds to the runtime call stack of the program. To define ωDDPAc, we require a lazy lookup function similar to that which characterized the system of Section 5.3. In this case, the graph affords us the ability to skip over out-of-scope bindings without an integer counter by relying on a wiring process similar to the Wire function in Section 4. This leads us to a definition in near-perfect alignment with the lookup function of Section 4.1: Definition 5.12. Given control flow graph G, G (X , a 0 , C) is the function returning the least set of values V satisfying the following conditions given some a 1 < a 0 : Given the above lookup function, we define ωDDPAc in the same fashion as CFG construction in the analysis: an incremental construction of the graph structure based upon the values discovered by demand-driven lookup. See Figure 26 for the (sole) rule, in analogy with Figure 16 of the analysis. Comparing these two figures there is one non-trivial change that makes ωDDPAc slightly more refined (and is the source of the appended "c" in the name): variables f and v are looked up relative to the current context C, whereas DDPA's wiring rule drops this context information for simplicity. The definition of Active is also different in ωDDPAc, since this context is needed: It returns the set of possible contexts that could be active at this point in the graph. In Section 4's definition of DDPA, the context is not needed and the corresponding Active? function is a predicate rather than a function onto sets of contexts. The precise definition of Active is as follows.
Definition 5.13. Let Active(G, a) be least set C conforming to the following conditions: Note that C may not be finite.  We define Wire in analogy with Wire (Definition 4.8) simply by removing the hats from each term. We then define ωDDPAc itself as follows:

Proof of Equivalence.
We now show that ωDDPAc as defined in Figure 28 is equivalent to the lazy lookup system of Section 5.3. This is the final operational semantics equivalence proof: It allows us to connect the standard operational semantics to this graph-based form.
Considering Alignment. Alignment between the lazy lookup system and ωDDPAc is somewhat more involved than the previous alignments. Non-recursive programs always evaluate in lock-step between the two systems-for each list substitution in the lazy lookup system, the same nodes and edges are added in ωDDPAc-but, in recursive programs, the ωDDPAc semantics may "get ahead" of the lazy lookup semantics.
For instance, recall the ω-combinator example from Section 2.5 that we reproduce here in Figure 27. The lazy lookup system will continuously grow a list W by substituting the call site a = x x (underlined in the figure) with itself surrounded by wiring annotations. In ωDDPAc, however, the graph stops growing after the Application rule acts once on each call site. The cycles formed by that rule effectively describe all such lists W , since those lists will grow according to a pattern in this divergent case.
This does not mean that ωDDPAc is sub-Turing or that we claim to predict halting (via whether the End node is Active). While the size of the concrete control flow graph for any program is finite, computing the next graph edge to be added is a recursively enumerable but not recursive property: The Application rule must consider unboundedly many contexts, any one of which may lead to additional control flow edges.
Non-divergent recursive programs exhibit behavior similar to the above. In a program that recurses a fixed number of times (e.g., via a counter), the count-down step would be processed exactly once by the graph-based semantics. In such cases, a finite number of steps in the lazy lookup system is sufficient to catch up to the graph-based system, at which point the systems continue operating in step.
Historical Evaluation. Defining the bisimulation between these two systems is not straightforward, because ωDDPAc leaves application nodes in the graph. This means there are nodes in G that do not appear in W . At the same time, the bisimulation must be two-directional: We cannot allow arbitrary extra content in G. That is, we must allow "good junk" (old call sites) without allowing arbitrary "bad junk. " We address this by defining a form of historical evaluation for the lazy lookup system. This system mimics the lazy lookup system but retains every step of evaluation and so implicitly retains the original application nodes. We use W to represent a list of W . We formally define that system as follows: The lazy lookup system and the historical lazy lookup system are trivially equivalent: We can use the historical information to drive our bisimulation.
Bisimulation and Equivalence. Define bisimulation between the historical system and ωDDPAc as follows: Definition 5.17. We write W G to mean the following: As in the previous proofs, this proof relies upon a bisimulation preservation lemma: Much like Lemma 5.10, the inductive case between the stackless and lazy systems, this lemma is asymmetric, because ωDDPAc takes larger steps than the historical system. In the case of a terminating recursive function call, for instance, the historical system may need to take many steps to catch up to a single wiring in ωDDPAc. Nonetheless, this number of steps is always finite. In cases of divergence, the historical system never catches up but the graph is unable to grow any further.

Overall Equivalence
The previous equivalences can be combined to produce the desired overall equivalence: The closure-based λ-calculus of Section 5.1 is equivalent to ωDDPAc. For notational convenience, we write ↓ to indicate a sequence of −→ 1 (in all overloadings) that cannot make further progress; for instance, G ↓ G iff G −→ * G and no G G exists such that G −→ 1 G . We then phrase equivalence as follows: This equivalence is key to our soundness proof, which we present in the next section. Before proceeding, however, we take a moment to consider some unusual features of ωDDPAc.
Reflecting on ωDDPAc. Every operational semantics for the λ-calculus we are aware of relies upon one or more of the following mechanisms: • Substitution (as in classic presentations of λ-calculus) • An environment and closures (as in the CEK machine [15]) • Variable freshening (as in our previous work [36]) None of these three mechanisms appear in the definition of ωDDPAc, yet Theorem 5.19 shows it is a full and faithful implementation of CBV λ-calculus. This shows that the DDPA analysis, which is clearly very close in spirit to ωDDPAc, emerges from a fundamentally different operational basis. Optimal λ-reduction [27] is perhaps the closest to ωDDPAc of the existing λ-semantics; it is a substitution-based model but maximally shares syntax subtrees in a graph structure. The sharing graphs in that semantics grow unboundedly unlike the ωDDPAc graph.
Section 5.4 pointed out that the graph constructed by ωDDPAc is finite and monotonically increasing. From this, we can demonstrate a polynomial bound on ωDDPAc execution steps: This lemma may at first appear to be a blatant contradiction: We have a full and faithful interpreter for the λ-calculus, but it is tightly bounded in the number of steps it can take. On further inspection, we have only shifted the work-the global state space is small, but searching through all the possible contexts is a potentially unbounded search.
To understand this, it may also be helpful to informally consider an alternative presentation of ωDDPAc where the global state is not just G but also includes the current stack context and so is G, C : Here there is no need to search for a viable stack C since it has been explicitly recorded, but the stack can grow arbitrarily and so the overall state space is now unbounded, unlike the case of ωDDPAc.
In summary, any program evaluated using the ωDDPAc operational semantics requires at most O (n 2 ) steps, but the operational semantics relation must be undecidable by Turing-completeness of the ωDDPAc interpreter, and even the single step relation of ωDDPAc has to be undecidable. (1) {Start << e << End} ↓ G is undecidable.
Proof. For (1), this follows directly from Theorem 5.19 and the Turing completeness and thus undecidability of the call-by-value λ-calculus computation relation that computes to a value. For (2), suppose lookup was decidable. Since there are only finitely many states G possible for any program by Lemma 5.20 there are also finitely many non-repeating sequences G 1 , . . . ,G n . If lookup were decidable, then it would be possible to enumerate all these sequences and for each one verify if each of the steps in a given sequence were legal using lookup and the single-step rule of Figure 26 (which would itself trivially be decidable if lookup is decidable). But then it would also be possible to decide (1) as we could consider all sequences with G 1 = {Start << e << End} and G n = G and check if any of them constituted a valid n-step computation, a contradiction.
The following section demonstrates how DDPA soundly approximates this operational semantics.

SOUNDNESS
We now show the soundness of the analysis with respect to ωDDPAc and therefore, by Theorem 5.19, with respect to the lambda calculus in Section 5.1. Comparing DDPA with ωDDPAc, there are two key differences as outlined above: • Call stacks C in ωDDPAc correspond to context stacksĈ in DDPA.
• The Active function returns a set of possible call stacks while the Active? function is a predicate.
In brief, the soundness proof proceeds by showing DDPA is abstract interpreter for ωDDPAc. The abstraction function α is lossless on all components of the language grammar, mapping each term to its hatted counterpart (e.g., x maps tox, f maps tof , etc.). Call stacks [c 1 , . . . , c n ] are mapped to context stacks Push(α (c n ), . . . Push(α (c 1 ), ϵ )) and Active? holds when Active returns a non-empty set. No other information is lost. This outline is now expanded.

DDPAc
Our first step is to define an analysis DDPAc and then to weaken that analysis to DDPA (in accordance with Figure 17). The DDPAc analysis is a midpoint between ωDDPAc and DDPA; in fact, DDPAc can be seen as a generalization of the ωDDPAc defined in Section 5.4. DDPAc uses the context stack models and lookup function from Section 4.1 but relies upon a function Active (the abstract version of Definition 5.13) to produce a set of valid contexts for each lookup. For each Σ, we define that abstract function as follows: Definition 6.1. Let Active(Ĝ,â) be least setĈ conforming to the following conditions: • Ifĉ < â, then Active(Ĝ,ĉ) ⊆Ĉ.
Note thatĈ may not be finite.
In the interest of showing DDPAc to be an analysis in its own right, we observe that Active can be computed for finite context stack models Σ. One simple algorithm is to store a set of context stacks for each node inĜ. Initially, Start has the only non-empty set (containing ϵ); the result of Active is calculated by propagating the context stacks throughout the graph to saturation.
DDPAc can re-use most other definitions from DDPA including lookup. Only the ωDDPAcinspired evaluation rule needs to be incorporated to complete the definition. If we choose Σ to be Σ ω , then this system is identical to ωDDPAc. In that case, ϵ corresponds to the concrete call stack []; this is not a guarantee of call stacks (since Pop(ϵ ) = ϵ but this is not true for []), but it holds in DDPAc due to the definition of Active. Soundness of DDPAc can be demonstrated by abstract interpretation. The abstraction function α here is identity except in the case of call stacks and context stacks: each call stack [c 1 , . . . , c n ] is mapped to the context stack Push(α (c n ), . . . Push(α (c 1 ), ϵ )). We first establish soundness of lookup: Proof. By induction on the size of the proof of G ([x] || X , a, C) and case analysis on the rule used at each proof step.
We then state soundness of DDPAc as follows: Proof. By matching each premise of the concrete Application rule to its abstract counterpart. In particular, lookup is monotonic in the size of the graph.

DDPA
Finally, we demonstrate that DDPAc is conservatively approximated by DDPA from Section 4. The only difference between these two analyses is in the handling of active nodes: DDPAc calculates possible contexts using Active, whereas DDPA simply determines whether any contexts exist using Active?. This weakening is motivated by performance, as computing all possible contexts for a given program point is generally expensive. While DDPAc is more precise than DDPA in this respect, this increased precision is rarely necessary.
We show DDPA approximates DDPAc by observing the differences in the lookups of Figures 16  and 29. In DDPA, the context provided to lookup is always ϵ; no other context stacks are provided to lookup. However, Definition 4.4 requires that Pop(ϵ ) = ϵ. With respect to lookup, this imposes a form of subsumption on finite call stack models with ϵ at the top of the lattice. Formally, we state this property as follows: As a consequence, each lookup and CFG construction step in DDPAc is approximated by DDPA. We formalize soundness between these systems as follows: Proof. By Lemma 6.5 and matching each premise of the rule in Figure 29 to its counterpart in Figure 16.
We now can assert the soundness of DDPA as follows: Proof. By Lemmas 6.1 and 6.6 and the monotonicity of Definition 4.6.

DECIDABILITY
We now show the analysis defined in Section 4 is decidable. We in fact show a stronger result: For a kCFA-like stack model (See Definition 4.4) that retains the fixed k most recent call sites, we show that the control flow graph can be constructed in polynomial time.
Much of the decidability argument is immediately evident: Functions like Active? are computable by inspection and, for any particular program and finite Σ, CFG size can be bounded by simple counting arguments. The core of the proof is showing the decidability of variable lookup, Definition 4.6. So first we will first show lookup is decidable. We then formally prove the above decidability property for DDPA.

Decidable Lookup
In Section 3, we informally outlined how we can implement lookup in terms of a pushdown reachability question. Recall from that discussion that a state is created in the PDS for each (program point, variable lookup) question and additionally including the (approximate) context stack in the state. PDS transitions then represent recursive calls to lookup. Inspecting Definition 4.6, as written it is in fact not directly implementable as a pushdown reachability problem. The particular catch is clause 6, which invokes lookup twice; if this was modeled as two edges in the PDS, then it would mean either path can succeed, but clause 6 requires a conjunction of two lookups and that is not directly encodable.
The solution for two invocations of lookup is not all that difficult: the PDS edge transitions for the first lookup action, and a frame is pushed on the PDS stack to trigger the second invocation. To formalize this idea, we first develop an equivalent version of lookup that replaces clause 6 with only one call to lookup and pushes the remaining one on the stack; various other clauses are also needed to implement continuations. We then demonstrate the two lookup functions are equivalent. Since the second definition of lookup is directly implementable on a PDS this shows lookup is computable, since pushdown reachability is computable in time polynomial in the size of the PDS [6].
Our alternate definition of lookup requires generalizing the lookup stack to a continuation stack to keep track of other lookups that must eventually be performed and for how to combine the results of multiple lookups; the stack grammar appears in Figure 30. This continuation stack will directly map to the stack of the pushdown system. We present the definition of the lookup function and discuss the role of these continuations below.  This definition is very similar to the original lookup of Definition 4.6, but one key difference is critical for decidability: Clause 6 of the original Definition 4.6 is here divided into clauses 6a and 6b. This is to solve the problem alluded to above of the two invocations of lookup in that clause: Each clause must invoke lookup only once to be encodable as a PDS, so the two lookups in that clause are now expressed as two different clauses and with continuation frames added on the stack to connect them. In this alternative formulation, lookup of a value that was returned from a function call proceeds as follows: (1) Clause 6a reacts to a function exit wiring node by triggering the addition of [x f , Capture (2), Jump(â 0 ,Ĉ), RealFlow?] to the stack. (While this might look like jibberish now, let us proceed with how the lookup function works and the purpose of these new "continuation frames" will be made clear.) Proof. By induction on the size of the proof ofĜ (X ,â,Ĉ) and then by case analysis on the clause used from Definition 4.6.

Proving Lookup Is Decidable via PDS
Reachability. Now that we have removed multiple invocations of lookup from the original clause 6, each lookup clause examines a finite prefix ofK, applies some number of computable restrictions (such as limiting the form ofâ 1 or checking the predicate MaybeTop), and then specifies a lower bound onV . Given this normal form, it is possible to directly encode lookup as a PDS reachability problem. We let each pair of program pointâ in G and contextĈ inĈ define a state in the PDS. For everyâ 1 < â 0 , each clause of Definition 7.1 dictates a set of transitions in the PDS. For example, for anyĈ, clause 3 transitions from â 0 ,Ĉ to â 1 ,Ĉ , poppingx and pushingx . Accepting states are dictated by clause 1. Given this encoding, each valuev inĜK,â,Ĉ corresponds to a nodeâ = (x =v) in the set of nodes reachable in the PDS from initial state â,Ĉ with initial stackK. This leads us to the following lemma: Proof. By reduction to the problem of reachability in a pushdown system accepting by empty stack. Pushdown reachability is computable in time polynomial in the size of the automaton [6,11], so it suffices to bound the number of states and transitions by the product of the sizes ofĜ andĈ. States are bounded by this product by definition. Transitions are bounded by this product, because the grammar of stack elementsk is bounded by this product and each clause of Definition 7.1 pushes and pops a constant number of stack elements.
The proof strategy in the rest of this section can generally be applied to any finite context stack model, but we primarily concern ourselves with the kDDPA analyses described in Section 4.1.1. For that reason, it is helpful to simplify the statement of the lookup decidability lemma in those cases: Note that if k was not fixed and was in fact increasing with the size of the program, lookup would become exponential.

Proof of Decidability
Having shown lookup to be polynomial, it is now possible to show a similar result for the overall analysis. Lemma 7.5. Variable lookup is monotonic; that is, for anyx andâ, ifĜ 1 Proof. Variable lookup is encodable as a PDS reachability problem (see Lemma 7.3) and the PDS grows monotonically with the graphĜ. PDS reachability grows monotonically with the PDS. Therefore, the set of results from variable lookup grows monotonically with the graphĜ. Lemma 7.6. The evaluation relation −→ * is confluent.
Proof. By inspection of Figure 16, single-step evaluation only adds to graphĜ. The Active? relation is also clearly monotone: any enabled redex is never disabled. Confluence is trivial from these two facts. x ĉ =x , and only one of each of those nodes can exist for each call site / function body pair in the source program:ĉ is the call site, andx /x are variables in that call site and function body source, respectively. So the number of nodes that can be added is always less than two times the square of the size of the original program. A similar argument holds for added edges.
We letĜ ↓Ĝ abbreviateĜ −→ * Ĝ such thatĜ −→ 1Ĝ . We write e ↓Ĝ to abbreviate Embed(e) ↓Ĝ; this means the analysis of e returns graphĜ. Given the pieces assembled above, it is now easy to prove that the analysis is polynomial-time.
Theorem 7.8. Fixing Σ to be some Σ k and fixing some expression e, the analysis resultĜ, where e ↓Ĝ, is computable in time polynomial in the size of e.
Proof. By Lemma 7.4, each lookup operation takes poly-time. The evaluation rules are trivial computations besides the required lookups and, by Lemma 7.7, there are polynomially many evaluation steps before termination. Thus e ↓Ĝ is computable in poly-time.

IMPLEMENTING LOOKUP
We showed in the previous section how lookup may be encoded as a PDS reachability question. Although lookup on the PDS described in Section 7.1.1 is polynomial time (Lemma 7.4), the PDS is quite large and its naive construction is slow in practice.
In this section, we formally describe PDR's, which can be viewed as syntax for schematically defining collections of PDS transitions with a single piece of syntax, collapsing the state-space blowup alluded to above. We do not prove theoretical bounds of the PDR-the worst case time complexity of reachability remains the same-but the evaluation of this approach in Section 10 demonstrates that a DDPA implementation using this approach is comparable to other recent higher-order program analyses.

Pushdown Reachability
The standard algorithm for solving pushdown reachability is to perform an edge closure [6]: For each pair of adjacent matching push and pop edges, add a single transitive no-op edge. For any path between two states in the original automaton, closure will ensure a path between those same states that consists solely of no-op edges. Pushdown reachability is decidable in polynomial time [6] but, if the number of states and edges is large, it still may be slow in practice.
To illustrate the size of the PDS described in Section 7.1.1, let us consider rule 7 of the alternative lookup function (Definition 7.1). This rule dictates that the PDS should contain an edge for each pairing between a variable (x) and an alias clause not defining that variable (x =b such that x x); that is, this one rule increases the size of the PDS at least quadratically in the size of the program and reachability is a polynomial of that size. In practice, most of those edges will not be used: Not every variable is looked up from every position in the CFG, if only for reasons of scope.
Previous work [24] describes an efficient algorithm for reachability on pushdown systems that have a large number of states. This algorithm is also applied in the domain of program analysis. The algorithm expands the automaton lazily, adding states (with their outgoing edges) as necessary to solve the reachability query; this avoids the addition of states that are not involved in computing reachability.
The PDS of Section 7.1.1 not only has a large number of states-one for each program pointâ in each contextĈ-but, as illustrated above, a large number of edges at each state. Our algorithm extends the work of PDCFA [24] by representing collections of edges schematically and adding them to the automaton on demand.

Formalizing Pushdown Systems.
Our objective is to construct an automaton to facilitate a reachability algorithm that produces results equivalent to that of naive pushdown reachability. For this reason, it is useful to formalize our notion of a pushdown system. We begin by defining common notation: Definition 8.1. Given a finite set of stack symbols Γ, we use Γ to denote the set {ϵ } ∪ {γγ ↓ | γ ∈ Γ} ∪ {γγ ↑ | γ ∈ Γ} of actions (no-ops, pushes, and pops) over Γ. We use γ to denote individual elements of Γ . We denote the set of lists of these actions as Γ * .
Note that this definition of a PDS has no marked start or finish states. We select a start state based upon the reachability question. As mentioned in the proof of Lemma 7.3, the finish states are those at which the PDS stack is empty.
It is also important to note that Definition 8.2 describes a "single action" pushdown system: Each transition may push, pop, or do nothing. Other PDS definitions, such as the canonical "single pop, multi push" formulation (in which each transition must pop one stack element and may then push any fixed string of stack elements), may be encoded in "single action" form by adding intermediate states.

Pushdown Reachability Automata
We now define PDRs. We will give a general PDR algebraic signature and will also point out how that signature is instantiated for our analysis. We begin with the domain, which is fixed throughout closure. This defines the states and stack elements of a PDR. Pre-states S are just the core state information. In DDPA, they are defined as contextualized states: pairs ofâ andĈ, which represent the small round nodes in Figure 11. Full PDR states Q include a pre-state element in S as well as an action stack from Γ * . Intuitively, the PDR state s [γ , ...] expresses "I will be in state s once I complete the actions [γ , . . .]." These are the small red intermediate states in Figure 14. Last, the PDR domain includes a finite set of dynamic pop actions Ψ to label actions having schematic target states; the purpose of dynamic pop actions will be clarified below.
Next, we define the general transition function signature for a PDR. Targeted transitions t have a concrete state target whereas untargeted transitions u have a schematic target that is made concrete by p. We show how Definition 7.1 is given a PDR transition specification in Section 8.4 below. Together, the PDR domain and PDR transition specification define the structure of a PDR separate from a particular reachability question.

Pushdown Reachability Closure
We now give the algorithm for the PDR reachability closure process. The objective of this process is to yield the same results as the closure of a corresponding pushdown system. Unlike PDS closure, which starts with all nodes and edges present, in PDR closure the states and transitions are lazily added based on the transition specification. This lazily constructed automaton we call the PDR graph: Definition 8.5. Given a PDR domain S, Q, Γ, Ψ , a pushdown reachability graph is a 3-tuple Q C , δ, η where In this definition, the set of lazily constructed current states Q C are the states in Q that are reachable from the starting point of a query and should be explored. The set of stack action transitions δ have the same meaning as in a PDS. Additionally, a PDR graph has a set of dynamic pop transitions η; the only transitions in this set are those attached to states that have already appeared in Q C . Recall from Section 3 that our goal is to determine reachability on schematic pushdown automata; the dynamic pop transitions η act as transition generators defined by the schema and we use them to add to the graph only those transitions that may affect the result of a particular lookup question.
With this data structure we now begin defining the closure process for computing reachability. For motivation, consider a PDR graph construction for our analysis: the lookup of a variable x from a program point p in the empty (i.e., unknown) context []. To compute this closure, it is sufficient to close over a PDR based upon the initial PDR graph {s [γ x ↓ ] )}, ∅, ∅ for s = p, [] . This graph contains a single state s [γ x ↓ ] (read: "I will be in state s once I push x onto the lookup stack") and no transitions. The closure process will perform this push and then apply the PDR transition spec to the resulting states, gradually expanding the graph to discover all states reachable from this starting location.
We formally define PDR closure as follows: Definition 8.6. For fixed PDR domain S, Q, Γ, Ψ and PDR transition spec t, u, p , we define =⇒ as the least relation between PDR graphs that obeys the rules in Figure 31.
The first three rules in Figure 31 are formal presentations of the basic PDS closure algorithm informally described in Section 8.2: When pushes can reach pops, both are canceled and a noop results. The Push+Dynamic Pop to State rule performs closure over the dynamic pops as described in Section 3.2: As new push elements reach dynamic pops, the appropriate specification function is invoked to determine the resulting destination. Note that this destination is an element of Q with a list of pending stack actions; the Pending Action rule ensures that such states are expanded to eventually reach their destinations. The Push+Dynamic Pop to Dynamic Pop rule performs a similar dynamic closure: The result is another dynamic pop, which is attached to the source of the push in a form of continuation processing discussed in Section 8.4 below. Finally, the Transition Expansion rule ensures that we add appropriate transitions for each state; although our graph initially contains no transitions at all, we draw transitions from functions t and u to add to the graph as closure proceeds. Note that the other rules add states reachable via pushes and no-ops to the Q C set, ensuring that transitions from those states will be explored.
In our analysis implementation, a variety of common optimizations are applied to prevent duplicate work and realize these rules efficiently, including an algorithm similar to the "work-list" algorithm of PDCFA [24] (which prevents two edges from being closed with each other more than once). Throughout closure, this algorithm ensures that only states reachable from an initial query state via a series of push and no-op transitions are "alive," meaning that they are the source of transitions in the δ and η sets.
An important feature of DDPA is that the same PDR structure can be used throughout the analysis of a particular program. This is a consequence of the monotonicity property proven in Lemma 7.5: when performing a second variable lookup, one need only add an appropriate lookup state to Q C and perform only the incremental new closure steps needed. This means that, if two lookup operations' proofs of Definition 7.1 share a derivation subtree, the closure work to establish that portion of the proof will only be performed once. In practice, this work sharing is utilized, for instance, whenever the analysis looks up two non-local variables from the same function; this has a significant effect, contributing to a 1,000× speedup relative to the proof-of-concept implementation that does not feature work sharing (see Section 10).

From Inductive Definition to PDR Specification
Finally, we show how Definition 7.1 is expressed as a PDR. For brevity, we focus on a select few cases of that definition. We will incrementally define the transition functions t, u, p by adding mappings to these transition relations as the discussion proceeds; we will similarly incrementally define the dynamic pop actions Ψ.

Using the Transition Functions.
We begin by considering rule 1. This rule is invoked when we discover the definition of the variable we are currently seeking, which is based upon the edges contained within our CFG. This could be encoded in a PDS as follows: For each CFG edge with a source clause of the formâ 1 =x =b and for each calling contextĈ, add an edge to popx and then pushb, with the state â 1 ,Ĉ as the ultimate destination. Since there may be many possible stack contextsĈ, there could be many possible edges added, illustrating the inefficiency of using a PDS directly.
Rather than enumerating these edges, we can define in our PDR a function tâ 1 <<â 0 for each CFG edgeâ 1 <<â 0 of this form: This is, whenever we can reach state â 0 ,Ĉ , we should add a pop ofx from it to an intermediate state, which will then pushb to reach state â 1 ,Ĉ . This function will be called during PDR closure with each currently reachable state, ensuring both that â 1 ,Ĉ is reachable for the lookup ofx (correctness) and also that these edges are only added to the PDR if a lookup forx occurs with context stackĈ (efficiency). Rule 1 does not require dynamic transitions, so no additions to u or p are needed here.

Dynamic
Operations. Some clauses base their behavior upon the contents of the stack. For instance, rule 7 addresses clauses that do not define the variable for which we are looking. To represent a clausex =x in the PDS, we must be able to pop and push any variable other thanx .
As stated above, encoding this rule in a simple PDS would require a pop-then-push transition for each variable in the program, but most of these transitions would never be used. Many unused transitions can be statically eliminated with simple reasoning; we can, for instance, omit transitions for out-of-scope variables. Static elimination is limited, however, as some transitions remain unused during analysis simply because they are not necessary to solve the question at hand; that is, the transition may not be necessary to look up one variable but may be necessary for another. Static elimination is also insufficient when modeling the stack-based continuations we discuss below.
We instead represent groups of transitions succinctly using a dynamic pop action. We include elements of the form ClauseSkip(s,x ) in our set of dynamic pop actions Ψ. We then define a function uâ 1 <<â 0 for CFG edges of the formâ 1 = (x =b) as follows: otherwise .
This ensures that a ClauseSkip is added to each PDR node representing this clause (in any calling context). We then write a function p Skip to ensure that, when an appropriate lookup variable arrives, it is handled correctly.
Recall that this function is invoked by the Push+Dynamic Pop to State rule when a push for variablex arrives at the state where the ClauseSkip was added by uâ 1 <<â 0 above. This function will then examinex. If it matches thex of the clause, then we add no transitions to the PDR: We have discovered the variable we want, and the tâ 1 <<â 0 function above will handle this case. Otherwise, we add edges that allow the lookup to proceed by ignoring this clause and moving to the one before it.
Again, PDR closure will ensure that this only occurs for lookup variables that actually arrive at this program point. We can safely avoid adding to our PDR the vast majority of PDS edges that this clause would naively imply.

Continuations.
In addition to the above-described dynamism, PDR closure can model a rudimentary form of computation over multiple stack elements. This is best illustrated by implementing rule 9 (the "capture" rule) of Definition 7.1. This rule is instrumental in the restructuring of the lookup function to be amenable to embedding as a pushdown reachability problem: It allows the first subordinate lookup in Definition 4.6's rule 6 to be written as Definition 7.1's rule 6a and allows the result of that subordinate lookup to be stored in the stack for use by rule 6b.
The PDR is further motivated by this use case: Statically constructing a pushdown system to support capture is prohibitively expensive, as the rule must apply to any state in the PDS. Specializing capture to the aforementioned rules 6a and 6b would require quadratically many PDS transitions in the number of states, while a more general form of capture applicable to, e.g., binary operators would require exponentially many transitions. As most of these transitions would be unused and as it is not clear how to preemtively identify those transitions, we rely upon the lazy nature of the PDR to solve this problem.
In a walk of the pushdown system, the capture behavior of rule 6a would require four pops: the value, the capture symbol, and two arbitrary elements. As our closure algorithm only admits one pop per transition, we encode this process by introducing four dynamic pop forms to the Ψ set. Each dynamic pop form represents a continuation in the process of capturing a value. Figure 32 illustrates this process with a sample PDR, showing the closure process operating on the transition functions we now describe.
We initially suppose that the PDR fragment contains only the nodes and solid edges in the middle of the diagram. Consider a naive walk of the automaton: the sequence of push operations in the chain of solid edges suggests that the node can be reached with [v, Capture(2),k 1 ,k 2 ] as the top four elements of the stack (withv topmost). Our goal is to reorganize those stack elements in accordance with rule 9.
We begin by adding our first new dynamic pop form, TryCapture1, to every new state. This is accomplished by the following function: After step is complete, TryCapture1 is added to the set of dynamic pop transitions by the Transition Expansion rule (Figure 31). This is denoted in the diagram by the unlabeled edge from to the rectangular node below it. This dynamic pop is used to begin the capture process whenever a value arrives. Next, we define a function that reacts to an element on top of the stack by storing it in a dynamic pop of the form DoCapture2(v,) for safe keeping: After step is complete, TryCapture2(v) is introduced to the graph. Note that this occurs regardless of whether a Capture(2) appears on the stack. This is a consequence of the single action model of PDR closure and is an intentional tradeoff. The single action model may lead to the addition of spurious TryCapture2 elements to the PDR graph, but it ensures that the work-list mentioned in Section 8.3 is bounded quadratically by the size of the graph.
Next, we wish to pop a Capture(2) element, continuing to keep thev value so we may insert it lower in the stack. We define a function to introduce a dynamic pop DoCapture1(v) to reflect this: We likewise must popk 1 and, likev, store it until we have enough elements to reorganize the stack. A function introducing a dynamic pop DoCapture2(v,k 1 ) addresses this: This function is step in the process above. By this point, we have popped three elements from the stack; once another element is popped, we will be ready to introduce a series of push operations that effectively reorders the stack. We complete the process by introducing a function to do so: The above function together with the Push+Dynamic Pop to State and Pending Action rules creates the dotted path along the top of the diagram that is labeled as step . Note that the above functions assume a fixed notion of q that was elided for simplicity. In reality, q would be included as a parameter of each of the above dynamic pop forms.
The above demonstrates how a finite sequence of arbitrary stack operations may be encoded in a continuation passing-like form in PDR closure. The full formal specification of DDPA uses this technique extensively to implement features such as binary operators and state. As with other closure operations, our use of a monotone, compact automaton yields significant work-sharing benefits To show our analysis is decidable, we only need to show it uses a (finite) PDR domain. The prestates S are pairs of program points and finite call stacks, which are finite; the stack grammar Γ is the continuation stackK from Section 7, which is also finite. The set of states Q is finite, because the length of the action lists from Γ * appearing in Q is bounded by a constant; this argument is subtle as some dynamic pop closures can lead to more elements being pushed onto the stack. With a finite Q, the set Ψ is finite. PDR closure thus operates on finitely many possible states Q C , δ, η and so will eventually run out of new states to add.
We show in Section 10 how this algorithm is fast enough to compete with modern higher-order demand-driven analyses.

EXTENSIONS
In this section, we outline four extensions: records, conditional branching, path-sensitivity, and mutable state. Our goal here is to show there is no fundamental limitation to the model given in the previous sections: DDPA can in principle be extended to the full feature set of a realistic programming language. For the first three extensions, we use the Overview grammar in Figure 1, and we incrementally build a theory with all three extensions since they are overlapping. For mutable state, we show how just the core theory is extended for simplicity.

Records
Here we outline an extension to a standard notion of records and projection as per the grammar of Figure 1. Consider a lookup of variablex: Ifx is defined asx =x . , then we must first look upx and then project from its record value.x may also be defined as a record projection and so on: In the general case, there could be a stack of record projections to be performed. This is similar to the non-local lookup stack of our analysis, and not coincidentally: Non-locals may be encoded in terms of records via closure conversion.
Fortunately, the continuation stack we defined in Section 7 lends itself to solving this problem. We add to the grammar of continuation actionsk a projection form . . We then augment Definition 7.1 with the following new clauses: Definition 9.1. We extend Definition 7.1 to records by adding the following clauses; assumê a 1 < â 0 .

Record Projection Stop
The two clauses above are symmetric: clause Record Projection Start introduces the projection action . when we discover that we will need to project from the variable we find while clause Record Projection Stop eliminates this projection action when the corresponding record value is found.

Conditional Branching
In Section 2, we give conditionals the syntax x~p ? f :f . The analysis of conditions is straightforward: The bodies are wired in just like function calls. The following clauses may be added to the records extension above to obtain an analysis for conditionals. Definition 9.2. We extend Definition 4.6 to conditionals by adding the following clauses. Assumê a 1 < â 0 andĉ below is always a conditional clause.

Conditional Top
In this version we do not refine the analysis based on whether the conditional pattern (abstractly) matched or not, so the analysis is not particularly accurate. The filtering extension below gives a much more precise analysis for conditionals.

Filtering for Path-sensitivity
We can formalize path-sensitivity in DDPA by keeping track of sets of accumulated patterns in our lookup function, and disallowing matches not respecting the patterns they passed through. We use Π + and Π − to range over sets of patterns that a discovered value must or must not match, respectively. Formally, we will define path-sensitive DDPA as an extension of both the core, records, and conditionals theories.

Conditional Top Negative
The Matching Value Discovery clause shows how the filters are used: Any value not matching the positive filters is discarded, and oppositely for the negative filters.
The original Conditional Top clause was the case where we reached the start of a case clause and search variablex was passed as the parameter; in that case, the clause continued by searching for the argument at the call site. Here, we have separated that clause into two cases. In Conditional Top Positive , the function was the first branch of a conditional, so we know that any discovered value is only relevant if it matches the conditional's pattern. Thus, we add the pattern to the filter set to constrain it so. Clause Conditional Top Negative is the opposite case.

State
Lookup in the presence of state may also be performed using only a call graph, but there are several subtle issues that must be addressed. We consider here a variation of the presentation language that includes OCaml-style references with ref First, a simple search back to find the most recent mutation or creation site may not always give the correct answer as references may be changed through aliases, as illustrated in the following pseudocode: If we only consider updates to the inner variable, then a search for its contents by the end of the program would incorrectly yield false. The correct answer is 0 due to the last line, which updates the same reference cell using the contents of outer. This is the classic problem of alias analysis: When searching for the contents of mutable variables, we must consider the possibility that statements not directly involving the cell we are examining may update it nonetheless. So, explicit alias testing is needed to verify potential aliases are not being passed by.
Second, we must be careful not to confuse allocation with an allocation site. Recursive functions that allocate cells illustrate this issue. In the following code, two cells are allocated: Here, the call f true returns two cells, both allocated at the site ref false. Although both ref values are allocated at the same program point, we must recognize that they are not aliases; thus, by the end of the program, a contains false and not 0. Recognizing this distinction requires us to bear calling context in mind when performing alias analysis.
Not only may DDPA be adapted to address state, but the alias analysis itself may be accomplished by the lookup function. The implementation machinery necessary to accomplish this alias analysis is verbose. For legibility and brevity, we present the necessary steps at a high level.
We begin by adding a single bit of information to contexts in the form of a function: IsDirty(Ĉ), which determines if a context is "dirty." We require that the initial context is clean and that contexts are marked dirty when precision is lost; for instance, Pop(ϵ ) is dirty, since the empty context ϵ indicates that we do not know anything about the stack above the current point. Dirty contexts will allow us to recognize the recursive case above-if the context has been pruned we cannot be certain on an alias question.
With these functions, we present revised clauses for lookup based upon the original Definition 4.6.
Definition 9.4. Definition 4.6 is extended to a stateful language as follows. First, the codomain of the function is modified to a set of pairs v,Ĉ by replacing the Value Discovery clause with the following: We writev ∈V to indicate v,Ĉ ∈V for anyĈ. Next, we add the following clauses: In the above definition, clause Dereference handles dereferencing. It finds the ref values that may be inx from the current point in the program; it then returns to that point to find all of the values that those variables may contain. This return is necessary, since we want the value at the point the ! happened.
Clauses May Alias and May Not Alias address cell updates. Clause May Alias determines if the updated cell inx 2 may alias the cell we are looking up; if this is the case, the value assigned by the cell update may be our answer. Clause May Not Alias addresses the case in which the updated cell may be different from the target of our lookup. This happens when the lookups of each variable yield different results or when they result in multiple cells-even if the sets of cells are equal, the orders in which the program modifies the cells might differ, so we take the conservative approach and call them different. Here, we use the IsDirty function to address the recursive allocation case described above. MayAlias and MayNotAlias can be simultaneously satisfiable; when that happens, the analysis explores both clauses May Alias and May Not Alias . Along with the above modifications, the existing clauses Skip and Function Exit need to be extended to support state. As written, clause Skip allows us to skip by call sites and pattern matches whose output do not match the variable for which we are searching. This is sound in a pure system, but in the presence of side-effects we must explore these clauses to ensure that they did not affect the cell we are attempting to dereference. We thus modify clause Skip by prohibitingb from being a call site or pattern match. We require a new clause similar to clause Function Exit but for the case in which the search variable does not match the output variable. In that case, we proceed into the body of the function but in a "side-effect only" mode: We skip by every clause that is not a cell assignment or does not lead to one. We leave side-effect only mode once we leave the beginning of the function that initiated it.

EVALUATION
We implemented DDPA and conducted a series of experiments to determine whether it is viable. The implementation supports the extensions described in Section 9 and is available along with the test cases, experiment runners, and raw results. 2

Goals
We evaluated our implementation of DDPA in comparison to the proof-of-concept implementation 3 [36] and to a state-of-the-art higher-order forward analysis, P4F [19]. We had three goals: first, our implementation must be correct and produce the same outputs as the proof-of-concept; second, it should out-perform the proof-of-concept; and, third, it should perform similar to P4F or better.
Our implementation succeeded at the first two goals. It produces the same results as the proofof-concept implementation throughout a test suite designed for the latter implementation's language. Further, our implementation (which includes the PDR automata from Section 8) delivers a speedup of at least 1,000× over the proof-of-concept implementation. Evaluation of our third goal-performance relative to P4F-is less straightforward and is the subject of the rest of this section.
We chose to compare to P4F, because it is a recent analysis similar to ours in expressiveness: It is flow-sensitive and polyvariant. P4F's reference implementation 4 aligns with our DDPA implementation in several ways: For example, they both lose precision on numbers and arithmetic operations. More interestingly, we chose P4F, because it allows us to shed some light into a broader open question: the tradeoffs between context-sensitivity and data dependence [39] in analyses for higher-order languages. DDPA and P4F represent the opposite sides of the tradeoff: DDPA approximates the call-stack and captures data dependencies exactly, while P4F captures the callstack exactly and approximates data dependencies. In analyses for first-order and object-oriented languages, it is clear that context-sensitivity dominates data dependence [55]. But the results in the rest of this section suggest that this may not hold in higher-order languages, because DDPA out-performs P4F in a some cases.

Test Cases
To compare P4F and DDPA, we selected a series of test cases that ran on both analyses' implementations. Our choice of P4F limited our selection of test cases, as we were unable to run the P4F reference implementation on bigger test cases closer to real-world programs (see Section 10.6). Instead, we selected test cases from P4F's reference implementation (including tests that did not appear in the corresponding P4F paper). We also included a test case, flatten, from OOAM's [23] reference implementation, 5 that is the only test case in OAAM's suite that is not included in P4F as well and that is supported by both implementations.
Unfortunately, as of this writing no standard benchmark for higher-order program analyses exists; however, as other evaluations in the literature use these benchmarks, they appear to be a workable approximation for such a benchmark suite. We describe these benchmarks below: eta Tests spurious function calls that do not affect the lookup subject. mj09 Tests the alignments of calls and returns. kcfa-2 and kcfa-3 The worst-case for k-CFA. Test non-local variables in increasingly nested functions. blur and loop2-1 Test functions with non-local variables created in a loop. facehugger Tests recursive functions with control-flow paths that may only cross if precision is lost. tak and ack Test recursive functions. cpstak Continuation passing style version of tak. Stresses the call-return alignment in our analysis. sat-1, sat-2 and sat-3 Brute-force SAT solver, an exponential problem. sat-1 solves a formula with four variables, and sat-2 and sat-3 solve the same formula with seven variables, which is defined as a curried function in sat-2 and as an uncurried function in sat-3. flatten Flatten deeply nested lists. map Map a function over the elements of a list. rsa Encryption and decryption algorithms from the RSA public-key cryptosystem. primtest Fermat primality test. deriv Symbolic derivation. regex Regular expression matching with derivatives.
The last four test cases are closer to real-world programs: rsa and primtest are numerical programs, and deriv and regex manipulate lists and symbols. The other test cases are microbenchmarks based on common functional programming idioms. Figure 33 contains statistics on these test cases including number of program points, number of function definitions, and so forth.
The test cases are written in Scheme and not in the language presented throughout Sections 4 through 9. In our experiments, we run DDPA on the output of a translator from Scheme to our presented language. This translator preserves the semantics of the abstract interpretation, but it may not preserve the concrete semantics. This compromise simplifies the analysis implementation by reducing the number of features it must support. For example, all arithmetic operations are translated into additions, because DDPA abstracts all numbers to the same value and loses precision on all arithmetic operations the same way. We also encode Scheme features that our implementation does not support: For example, lists are encoded as records that represent cons cells, and functions with multiple arguments are encoded as functions with a record argument.

Experiments
We conducted two experiments to compare our implementation to P4F's: Monovariant, in which we used the less expressive (and often, but not always, more performant) settings, and Polyvariant, in which we used the more expressive settings. In Monovariant, we chose k = 0 for both analyses, disabling context-sensitivity. In Polyvariant, we chose k = 1 for P4F, because that is the only polyvariance setting supported by the reference implementation. But choosing k for the DDPA implementation was less straightforward. We could not simply choose the same k for both analyses. The variables k in DDPA and k in P4F may share a name and serve a similar purpose (to determine the amount of context to preserve), but they have different implications. In DDPA, k applies both to the top-level queries and to the sub-queries necessary to fulfill it, causing the context stack to be exhausted more rapidly and the analysis to converge sooner. When compared to P4F, the k of DDPA may have a lesser impact on running time but may also achieve less precision.
So, when possible, we were conservative and chose k such that our analysis is at least as precise as P4F, modeling the control-flow accurately. In other cases, we defaulted to k = 1, which happened for one of three reasons. First, our analysis models the control-flow accurately with k = 0, so we chose k = 1 in the Polyvariant experiment to distinguish it from the Monovariant experiment. Second, no choice of k suffices, because the source of inaccuracy is a factor other than the contextstack finitization, for example, an arithmetic operation. Third, an accurate control-flow is hard to determine. This occurs in the bigger test cases, but k = 1 is a reasonable default for them, because they tend not to use convoluted higher-order functions. Our choice of k for each test case appears in Figure 33.
When running the experiments, we measured running times and memory use, but could not measure expressiveness (see Section 10.6). We ran experiments on a machine with an Intel Xeon (3.10GHz) processor and 8GB of RAM, running Debian 9.5. The machine was dedicated to running the experiments and the load average remained at approximately 1.

Results
We ran 10 trials per test case per experiment. The running times appear in Figure 33. The memory use remained low, never exceeding approximately 230MB, and correlates with running times: When an analysis runs for longer, it also consumes more memory. Our analysis used 0.3× as much memory as P4F on average, a difference we believe to be immaterial and attribute to the implementation languages: Our implementation is written in OCaml, while P4F's is written in Scala (see Section 10.6). The dispersion in all measurements was negligible: The coefficient of variation was lower than approximately 3%.

Analysis
In most cases, DDPA's and P4F's running times were within the same order of magnitude, supporting the goal that our analysis should perform similar to P4F. But in two test cases DDPA is much slower than P4F: deriv and regex. These test cases consist of data structure (list) manipulation and demonstrate the effect of our analysis' perfect continuation-stack precision. We conjecture DDPA is slower on those cases because it is also more precise: DDPA does not lose the connection when data flow into and out of a list, but P4F may. We leave for future work measuring how much precision is recovered (see Section 10.6). When structure-transmitted data dependencies are not fundamental to a program, our analysis performs similar to P4F, as illustrated by the other two test cases closer to real-world programs, rsa and primtest, which consist of numeric operations.
Beyond our analysis and P4F, our results illuminate the tradeoffs between context-sensitivity and data dependence [39] in analyses for higher-order languages. Our results suggest contextsensitivity may not dominate data dependence the same way it does in first-order and objectoriented languages [55]. In test cases kcfa-3 and rsa, for example, our analysis out-performs P4F. We conjecture this may be a consequence of how these languages are used: Closures with nonlocal variable references in functional languages are created more often and affect the analyses more than the structure-transmitted data dependencies in object-oriented languages.

Threats to Validity
Test Cases. Our test cases represent common functional programming idioms, but they are not at the scale of real-world programs, they do not stress the state extension, and some of them consist of numeric operations, on which both our analysis and P4F are imprecise. Unfortunately, no standard benchmark for higher-order analyses exists; these are the test cases used by many other publications and so serve as an approximation for such a benchmark. This shortcoming a broader issue on the evaluation of higher-order program analyses.
We settled for these test cases because, to compare the implementations of DDPA to P4F, we required cases that run in both implementations. This limited our choices, because P4F's reference implementation does not support features necessary to analyze many bigger programs and, even when it does, it may fail due to what appears to be an implementation bug. P4F terminates with an exception and does not produce a result when a non-function appears to flow into the operator position of a function call. This occurs regardless of whether the test case really contains an illformed function call or the analysis overapproximated. The minimal non-trivial Scheme program that triggers this behavior is the following: P4F's reference implementation loses precision on the condition and allows to flow into f. It then throws an exception when it reaches the function call because is not a function. Our implementation loses precision on the condition in a similar fashion, but it succeeds to analyze the function call and even detects that it may be ill formed. Our implementation also ran on bigger test cases that are closer to real-world programs, but they were excluded from the experiments for the lack of a P4F baseline.
We worked around this problem by selecting test cases available in P4F's reference implementation. We plan to address the broader issue of evaluating higher-order program analyses in future work.
Expressiveness Measurements. A general metric for expressiveness cannot exist, because different analyses may capture different program properties and their outputs may be incomparable. Other evaluations in the literature work around this issue by measuring expressiveness via a proxy, either an intrinsic analysis property or a client. For example, Earl et al. [12] measures an intrinsic property, the number of singletons (abstract value sets containing a single function); and Might et al. [34] uses a client, the number of function inlinings justified by the analyses.
Unfortunately, neither approach is practical for comparing DDPA to P4F, because the analyses do not share technical foundations. Intrinsic properties of the analyses are too far apart; for example, our analysis would be at an unfair advantage if we compared the number of singletons, because P4F's abstract functions include an abstract environment while DDPA's abstract functions do not. (Instead, DDPA relies on non-local variable lookups.) These differences also account for part of the reason why designing and implementing a client compatible with both analyses is an engineering problem of its own, which we leave for future work.
When possible, we worked around this issue by conservatively choosing k for our analysis such that it is as precise as, or more precise than, P4F. We will be able to fine tune this choice when we have a client compatible with both analyses.
Experimental Setup. We implemented DDPA in OCaml. P4F's reference implementation is written in Scala. The general performance difference between these languages is negligible for our purposes.

RELATED WORK
DDPA uses many concepts of first-order demand-driven CFL-reachability analyses [21] to give precise analysis of higher-order functions: Like demand-driven CFL-reachability analyses, DDPA is centered around using a CFG to look up variable values in a demand-driven fashion, calls and returns are aligned, and lookup is computed lazily. Two issues make a higher-order analysis more challenging: The CFG needs to be computed on-the-fly due to the presence of higher-order functions, and non-local variable lookup is subtle. The demand-driven analyses cited in this article delve further into the tradeoff between active propagation and demand-driven lookup, and this is something we plan to explore in future work. There are many other first-order program analyses with a demand-driven component; several use Datalog-style specification formats [42,46,56].
The challenges we face in precise non-local variable lookup are related to data propagation challenges in first-order languages. Intuitively, one might attempt to address non-local variables via a closure conversion pass [5]: We can explicitly add closure structures to the language syntax and function values become pairs between the function's code and a list or record containing the nonlocal values. While this translates the challenge into the first-order analysis space, however, it is not any easier to solve: The problem of finding the correct binding for a non-local variable is now the problem of accessing the correct field in a list or record. This problem is known to be difficult even in first-order program analyses: Reps [39] proved that these structure-transmitted data dependences are impossible to track perfectly. The tight relationship between the analyses of non-local lookup and of structured data lookup is clear in DDPA: Both non-locals and record accesses use the same continuation stack as per Section 9, and the requirement for a second fully precise (call) stack leads to the undecidability of ωDDPAc. There does not appear to be any mention of this connection between first-order data dependencies and higher-order non-local variable accesses in the literature.
Some first-order analyses also track structure-transmitted data dependencies (e.g., LCLreachability analysis [55]) are in a similar design space as DDPA but have a different optimization function. In higher-order analyses, the loss of data dependencies causes imprecision in closure lookup, which rapidly pollutes the CFG and degrades analysis precision. For this reason, DDPA preserves a perfect stack for data dependencies and approximates the call stack only. First-order analyses sometimes sacrifice data-dependence precision to improve call stack precision as there is a less drastic fall-off in overall precision when data dependencies are imperfect.

DDPA Compared with Higher-order Forward Analyses
Comparing DDPA with the extensive literature of higher-order forward analyses is a difficult task: While there are a great many overlapping concepts, they do not precisely align, and so it is hard to make accurate comparisons. Here it will have to suffice to point out some commonalities and differences.
Higher-order program analyses are generally based on abstract interpretations [8]; such analyses define a finite-state abstraction of the operational semantics transition relation to soundly approximate the program's runtime behavior. The resulting analysis has the same general structure as the operational semantics it was based on: Program points, environments, stacks, stores, and addresses are replaced with abstract counterparts that have finite cardinality, "hobbling" the full operational semantics of the language to guarantee termination of the analysis [31]. A sound analysis will visit the (finitely many) abstract counterparts of all reachable concrete program states, producing a finite automaton representing all potential program runs. Previous abstract interpretation based higher-order program analyses are forward analyses [11,24,29,31,34,35,44,52].
Non-local Variables. Dealing properly with non-local variables is a longstanding concern in higher-order program analyses; the classic environment problem [18,30,44] centers around obtaining precision in analyzing non-locals.
Perhaps the biggest contribution of DDPA is how our notion of call-return alignment also aligns non-local variables; this is the purpose of the additional non-locals stack that is not found in any previous work. Other works incorporating call-return alignment lack this non-locals stack and so do not obtain the degree of expressiveness we do with only call-return alignment. The particular advantage of aligning both locals and non-locals is that a full polymorphism model is obtained comparable to kCFA [44], without any explicit machinery for polymorphism.
In previous higher-order demand-driven analysis work non-local variables are not aligned and so call-return alignment cannot fully replace polymorphism-explicit let-polymorphism is also included [14,37]. Another analysis in this space, Boomerang [47], targets Java and also does not address call-return alignment for non-local variables. None of this previous work is flow-sensitive.
Some higher-order forward analyses incorporate call-return alignment [19,24,51], but they also do not align non-local variables. One sign of how DDPA is more powerfully aligning calls and returns than these works is that ωDDPAc, DDPA without a pruned call stack, is undecidable, whereas these analyses are decidable with a full call stack. So we are losing a full call stack but gaining accuracy on non-local variables. These analyses incorporate other elements to achieve a fuller effect of polymorphism: CFA2 [51] additionally uses a polymorphism model similar to CPA [1], with a different contour allocated for each different function argument; PDCFA [24] includes an (abstracted) call stack in the program state; and P4F [19] includes an orthogonal polymorphism layer of the kCFA variety. DDPA's improved precision on non-local variables is still insufficient for more sophisticated clients, including environment analysis [32], because it does not preserve information about allocations of concrete vs. abstract bindings, but a variation on DDPA called DRSF addresses this shortcoming [13]. Our PDR automata build on ideas in PDCFA [24] to achieve more efficient reachability results.
Polymorphism and Runtime Complexity. Another important dimension of expressiveness is polymorphism (aka context-sensitivity): whether functions can take on different forms in different contexts of use. The classic higher-order analysis polymorphism model is kCFA [44], which copies contours in analogy to forall-elimination in a polymorphic type system. But there are many routes to behavior that appears as polymorphism, and, as mentioned in the previous paragraph, callreturn alignment can provide different contexts for different function calls and achieves the same effect as polymorphism in DDPA. The example we gave in Figure 3, for instance, needs only callreturn alignment to give polyvariant behavior in DDPA. Another example of polymorphism as an emergent phenomenon of other program analysis features is the abstract garbage collection in ΓCFA, which can align calls and returns in tail position, for example, in the example program in Section 2.2 [32, Secction 6.6].
Even though we are using a call-stack approximation k levels deep in a similar fashion as kCFA, keeping the most-recent k frames, kDDPA polymorphism is not equivalent to kCFA. One sign that it must be different is that 1DDPA is provably polynomial (Theorem 7.8), whereas 1CFA is EXPTIME-complete [49]. The difference is that DDPA must also "spend" stack frames searching for the functions where non-local variables were defined and so for non-local variables requires more stack frames to get the same approximation. We conjecture that, for a program with a maximal lexical nesting depth of d, the analysis (k + d )DDPA will be at least as expressive as kCFA. The additional d levels are needed, because each lexical level gives the potential for one more level needed to search through to find the original definition of a non-local variable, in analogy with d-stages access links in a compiler implementation of non-locals. Each lexical level will entail its defining function being added to the call stack, and overall one extra function will appear per lexical level.
Insights into the runtime complexity of kDDPA are also gained when considering the complexity of non-local variable polymorphism. Non-locals are the (only) source of exponential behavior in kCFA [34,50]; in particular, if lexical nesting were assumed to be of some constant depth not tied to the size of the program, then kCFA would not be exponential. The complexity of kDDPA comes from the other direction: For any fixed k, the algorithm kDDPA is polynomial, but k needs to be increased by one for each level of stack alignment we wish to achieve in non-local lookup. Related to this are provably polynomial context-sensitive analyses that, like kDDPA, restrict context-sensitivity in the case of high degree of lexical nesting [22,34]. mCFA [34] is a polyvariant analysis hierarchy for functional languages that is provably polynomial in complexity. This is achieved by an analysis that "in spirit" is working over closure-converted source programs: By factoring out all non-local variable references, their worst-case behavior has also been removed. But, this also affects the precision of the analysis: Non-locals that are distinguished in kCFA are merged in mCFA. In kDDPA, the level of non-local precision is built into the constant k of how deep the runtime stack approximation is, so more precision is achieved as k increases.
The Need for Call Stack Approximation. DDPA requires the call stack to be finitely approximated to at most k frames in kDDPA. The call stack in fact has to be approximated by Theorem 5.19: The unbounded-stack ωDDPAc is a full and faithful λ-calculus interpreter. Still, kDDPA is currently wasteful on recursion, often unrolling a recursive function k levels only to see them all merge. We plan to address this shortcoming in future work. Note that higher-order analyses run into a similar problem; for example, kCFA keeps a k-depth stack, ΔCFA's [32] Δ-frame strings are finite approximations, and PDCFA [24] must regularize the call stack to incorporate abstract garbage collection in a decidable fashion.
Path Sensitivity and Must Analysis. DDPA only has a weak notion of path-sensitivity via the filters of Section 9.3. It also has only a primitive must-alias analysis in the mutable state extension in Section 9.4. So in these dimensions it is currently short of the state-of-the-art of forward analyses and represents an avenue for future work.
In the end, forward and reverse higher-order program analyses are in parallel universes: Many things appear the same but are subtly different, and some things that appear to be very different are in fact achieving a very similar effect. This shows up in the performance evaluation of Section 10: DDPA does much better on some examples compared to forward analyses and much worse on others. The fact that the performance varies so widely implies their theoretical basis is also far apart and points out that demand-driven analyses have the potential to bring new expressiveness and performance advances to higher-order program analyses.

Implementation Techniques
The technique of looking up variables on-demand and of aligning calls and returns was first developed in so-called CFL-reachability analyses for first-order languages [10,21,38,40]. To solve the call-return alignment problem some reduction to grammars or automata or other formalism is needed, and several different approaches have been used. CFL-reachability reduces to a contextfree language question [40], and reduction to pushdown automata formalisms were used for other first-order analyses [41]. In DDPA, we utilize the pushdown stack differently. The unboundedstack case, ωDDPAc, is undecidable, so we need not align calls and returns with the pushdown stack; however, we still need a continuation stack for non-locals lookup and other actions. This leads us to the use of pushdown systems in the implementation of our analysis and, ultimately, the PDR that was described in Section 8.
Considerable study ([4, 7, 26] among numerous others) has been made of pushdown-like automata and their reachability properties. Notably, although reachability on unrestricted two-stack pushdown automata is undecidable, restricted multi-stack pushdown automata have proven to be useful approximations in program analyses. One avenue of future work is the exploration of such automata as a basis for a DDPA-like analysis.
Our PDR automata were partly inspired by PDCFA [24], which computes pushdown reachability by maintaining a "compact" structure: Only states stemming from the start state are analyzed. Like that work, our PDR incrementally introduces transitions according to a general schema as they become relevant to reachability. Unlike PDCFA [24], however, our PDR closure algorithm retains a schematic form of transitions as they are introduced; this leads to smaller automata and less redundant effort, something that is particularly applicable to our domain. We then utilize this mechanism to develop a form of "continuation programming" on the automaton, which can handle the more complex clauses of variable lookup (including deep pattern matching). We are unaware of any work that performs this sort of non-trivial "pushdown reachability programming," and it may be a technique applicable to other domains.
Other researchers have taken a related approach of "compiling an analysis" to logic programming or Datalog DSLs [42,46,56]. We expect the implementation of this article could also be mapped onto Datalog, but we have chosen to take the PDR route as it encapsulates a simpler (polynomial vs not) complexity class and hope to thereby achieve better performance in the long run. Reduction to set constraints is also polynomial [25,28] and so could be an alternative compilation strategy worthy of study.

Other DDPA Precursors
DDPA in fact was not derived from first-order demand-driven analyses; it emerged as a flowsensitive extension of subtype constraint theory [2]. Each abstract program clause corresponds 1-1 to a subtype constraint; DDPA's addition is the happens-before relation that temporally relates the subtype flows.
The sharing of the CFG and the labels added to refine lookup was inspired by the sharing graphs of optimal λ-reduction [27]. In both DDPA and sharing graphs, the non-locals are not copied in but rather looked up via a careful trace back to their originating definition, with information added to paths to refine lookup accuracy-fan-in and fan-out nodes in optimal reduction, and z / z in DDPA.
We are not the first to be inspired by sharing graphs for program analyses; but the existing work is closer to a subtype constraint inference system than to DDPA, because, while sharing graphs are used, non-local arguments are wired in directly [49].
We are not the first to propose program analyses for higher-order languages that have a "demand driven" component [9,48]. Dubé and Feeley [9] propose an analysis that iterates between a standard forward pass and a refinement pass that re-tunes the (forward) analysis based on the results of the previous pass, demanding more precision only where it is needed, and then making another pass. This analysis is thus demand-driven in a different sense than DDPA-at the root it is a forward analysis. DDP [48] is more demand driven in our sense in that it has a value lookup process that is explicitly goal directed. Like Dubé, it incorporates an alternating demand-driven and forward dataflow algorithm at a smaller scale. DDP is focused on object-oriented programs so is not addressing the non-locals issue of functional programs, and it is flow-insensitive. Also, DDP queries cannot share intermediary results the same way DDPA lookups do, because the DDP query that runs first may prune a subgoal that is distant from it, giving it an overapproximate solution that is trivially true, and precision on that subgoal may be essential to a later query.

CONCLUSION
In this article, we have developed a DDPA for higher-order programs, extending ideas of firstorder demand-driven analyses to higher-order functions. The primary novelty is the use of a separate non-locals stack that allows call-return alignment to strictly subsume polymorphism; previous higher-order demand-driven analyses did not align non-locals. DDPA has flow-, context-, and (limited) path-sensitivity as naturally emergent properties.
We believe DDPA shows promise primarily because it represents a significantly different approach compared with the existing large literature of higher-order program analyses. A high-level analogy can be made with eager and lazy programming languages: It is a fundamental decision in language design that approach to take and there are significant tradeoffs. We believe the demanddriven side of higher-order program analyses deserves further exploration.
We have established soundness of DDPA and proved a polynomial-time bound on kDDPA. The abstract call stack must be restricted to at most k frames, unlike in first-order demand-driven analyses: We show that ωDDPAc, DDPA with an unbounded call stack, fully and faithfully implements the λ-calculus and so is Turing-complete.
We described our implementation of DDPA that uses a novel PDR automaton, a higher-order abstraction of a PDA that significantly improves the efficiency of variable value lookup. We gave a high-level specification of the implementation and presented benchmark results that shows demand-driven analyses have very different tradeoffs compared to forward analyses, meaning they show promise for improving analysis expressiveness and performance. The implementation includes a broader feature set than the theoretical treatment, showing the methodology can scale.