Abstract 1 Introduction 2 Overview 3 Enumerative Synthesis and Heap Mutations 4 The SObEq Enumeration 5 Implementation 6 Experimental Evaluation 7 Related work References

Bottom-Up Synthesis of Memory Mutations
with Separation Logic

Kasra Ferdowsi ORCID University of California San Diego, CA, USA Hila Peleg ORCID Technion, Haifa, Israel
Abstract

Programming-by-Example (PBE) is the paradigm of program synthesis specified via input-output pairs. It is commonly used because examples are easy to provide and collect from the environment. A popular optimization for enumerative synthesis with examples is Observational Equivalence (OE), which groups programs into equivalence classes according to their evaluation on example inputs. Current formulations of OE, however, are severely limited by the assumption that the synthesizer’s target language contains only pure components with no side-effects, either enforcing this in their target language, or ignoring it, leading to an incorrect enumeration. This limits their ability to use realistic component sets.

We address this limitation by borrowing from Separation Logic, which can compositionally reason about heap mutations. We reformulate PBE using a restricted Separation Logic: Concrete Heap Separation Logic (CHSL), transforming the search for programs into a proof search in CHSL. This lets us perform bottom-up enumerative synthesis without the need for expert-provided annotations or domain-specific inferences, but with three key advantages: we (i) preserve correctness in the presence of memory-mutating operations, (ii) compact the search space by representing many concrete programs as one under CHSL, and (iii) perform a provably correct OE-reduction.

We present SObEq (Side-effects in OBservational EQuivalence), a bottom-up enumerative algorithm that, given a PBE task, searches for its CHSL derivation. The SObEq algorithm is proved correct with no purity assumptions: we show it is guaranteed to lose no solutions. We also evaluate our implementation of SObEq on benchmarks from the literature and online sources, and show that it produces high-quality results quickly.

Keywords and phrases:
Program synthesis, observational equivalence
Copyright and License:
[Uncaptioned image] © Kasra Ferdowsi and Hila Peleg; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Software and its engineering Automatic programming
Related Version:
Full Version: https://zenodo.org/records/15271843
Supplementary Material:
Software  (Source Code): https://github.com/peleghila/sobeq
Acknowledgements:
The authors thank Nadia Polikarpova for her help with formalisms, and Ilya Sergey and Yotam M. Y. Feldman for reading many drafts of this work and asking very good questions.
Funding:
This paper was published with the support of the Israel Science Foundation (ISF) #651/23.
Supplementary Material:
Software  (ECOOP 2025 Artifact Evaluation approved artifact): https://doi.org/10.4230/DARTS.11.2.2
Editors:
Jonathan Aldrich and Alexandra Silva

1 Introduction

Program Synthesis is the task of automatically generating a program that satisfies a given specification. A popular specification modality for synthesizing general-purpose programs is input-output examples, where a solution is defined as a program that, evaluated on each provided input, produces the provided output. This is commonly referred to as Programming-by-Example (PBE) [23, 16, 7, 21, 26]. PBE is notable because, unlike logic-based specifications [49, 56, 29], it does not require specialized knowledge to use, and values for it can be collected from a programming environment. It has been used to develop synthesizers for both end-users [23, 60] and programmers [17, 16, 12].

Consider a synthesis specification for a JavaScript program where, given an array of integers arr of length n, we want to return the sum of the smallest and largest elements of the array. We might specify this problem with the example:

{arr[10,100,90,1,2],n5}99

If we assume the value in arr is disposable, e.g., the program may sort it in place, the following is a solution that we would like to synthesize:

arr.sort()[0] + arr[n-1]

To solve this synthesis task, we can select a set of components (operations, library functions, etc.) and search the space of all programs constructed from those components. There are many ways to search, including constructing a representation of the space that can be traversed [15, 39, 23] and encoding the problem for a solver [56, 30]. Such approaches place heavy constraints on which components can be included. But the simplest form of search is to enumerate the space of programs [13, 19, 25, 52] by applying the components, then test resulting programs. Because the space of programs is astronomical, this approach is prohibitively slow when implemented naively. Moreover, most mechanisms to prune the space are domain-specific and restricted by the manual effort for handling each component individually.

AST size 5:
15 2
n + arr[0] arr[n - 1]
AST size 3:
10 [] 4
arr[0] arr.slice(n) n - 1
AST size 2:
[1,2,10,90,100]
arr.sort()
AST size 1:
[10,100,90,1,2] 5 0 1
arr n 0 1
(a)
(b)
Figure 1: (a) Classical OE bank of programs with their observed evaluation results used to enumerate pure programs. Size 2 contains arr.sort() that sorts in-place, making compositions with it incorrect. (b) AST of the target program, divided into the variable states in which each subexpression is evaluated.

Bottom-up Enumeration.

One technique that does not require any domain-specific effort is bottom-up enumeration with an Observational Equivalence (OE) reduction [59, 2]. OE uses components to combine smaller programs into increasingly larger ones. To prune its search space, it maintains a bank of programs that are representatives of the equivalence classes of observationally equivalent programs, i.e., programs that, on the given inputs, evaluate to the same outputs.

New programs are obtained by applying components from the component set to smaller sub-expressions already in the bank. For example, in Figure 1(a) applying + to the representative of equivalence classes 5 and 10 will yield the program n + arr[0], which will be banked as the representative of equivalence class 15. The bank lets the enumeration memoize the evaluation results of each program, and use them when evaluating larger programs, rather than re-evaluating each full program from scratch. The space is pruned by keeping only one program per equivalence class. Despite discarding many programs, this ensures no solution that existed in the un-reduced space is lost. Previous works use variations of this technique in synthesizers for C [59], Python [17, 16], JavaScript [45], Java [22], Selenium [36], and OCaml [2, 40].

Enumeration with heap mutations.

The standard assumption of OE is that all components are pure and cause no side-effects [40, 17, 60, 59, 2, 36], and any mutations of the state in the target program are performed outside the synthesis task. Other synthesizers make assumptions on the intermediate states [16, 47], reducing the problem to synthesis of pure programs.

This is because OE has a fundamental limitation when handling mutating programs. Notice that JavaScript’s sort() works in-place. In the example above, the call to arr[n-1] evaluates to a different value if evaluated before arr.sort() versus after it. A classical OE enumerator that performs its OE-reduction based on variable values in the inputs would consider arr[n-1] to be in the equivalence class 2. However, unlike its equivalent in languages like Python, arr.sort() also returns a self-reference, allowing the result to be composed into larger expressions, as in our target program. If the enumeration itself does not account for this, it will use the wrong memoized value computed using the unsorted value of arr in its candidates, yielding a result that does not actually satisfy the specifications. Moreover, the enumeration may see another program whose value is really 2, discard arr[n-1], and never find this solution.

In other words, the classical bottom-up enumeration with OE described above is incorrect with memory mutations. This was also a problem in previous work: some synthesizers ignore the issue, leading to an incorrect enumeration [22, 45]. Peleg et al. [45], for example, allow JavaScript’s in-place sort(), breaking their own purity assumption.

The key question addressed by this work, then, is how to correctly enumerate mutating programs without losing solutions while maintaining enough pruning to make the process tractable.

The Problem with Mutations.

Classical OE considers only one memory state, the state both before and after evaluation of any expression. A composition with memory mutations, however, must consider a potentially different after-state. Figure 1(b) shows this: sub-programs after a mutation are evaluated in a modified before-state, the after-state of the mutation. Since composition by the synthesizer simulates evaluation, composition with a mutating expression requires that the synthesizer to be aware of modified before-states. The synthesizer will need to generate complex expressions, e.g., arr[n-1] in Figure 1(b), in a modified state.

Naively, we could store alongside each program not only its evaluation results, as in classical OE, but also its before- and after-state. While this would correct the equivalence relation and ensure programs are not erroneously discarded, the enumeration would still need to enumerate programs over and over: the program n-1 that yields the result 4 would be enumerated both when arr is in its initial state and its modified state, and, as we do not know what the sequence of mutations in the target program will be (if any), in many other states as well. This is not only costly, as it creates many copies of the program in different equivalence classes, it is unnecessarily wasteful, since n-1 is independent of the value of arr. We need an alternative equivalence reduction that both composes valid evaluation sequences, and constructs a compact program space.

Our approach.

We introduce SObEq (Side-effects in OBservational EQuivalence), to our knowledge the first bottom-up proof-directed synthesis technique. SObEq is a bottom-up enumerative algorithm for expressions proven correct in the presence of mutating components.

A key insight of SObEq is that mutating operations only mutate a small portion of the state. If the variable state, usually considered as a variable valuation or a stack, is considered as a heap, then Separation Logic’s [51, 44] concept of local reasoning can be used to reason about programs in the synthesizer’s space. The Frame Rule gives us the tools for a compositional application of operations, in much the same way that bottom-up synthesis already does for pure programs. Thanks to the Frame Rule, a compact representation of programs suffices, rather than considering the full variable valuation every time.

We define Concrete Heap Separation Logic (CHSL), a representation of the operational semantics of the language in the style of Separation Logic. For example, in CHSL we describe the mutating program n++ evaluated on the example’s input as the triple {n5}n++{n6;5} where the precondition n5 describes n’s initial value, and the postcondition n6;5 describes its value after mutation and its result. Instead of searching for a program, SObEq searches for a CHSL derivation from inputs to outputs. Separation Logic has been used in program synthesis before [49, 29], including where information from the environment can be leveraged as a SL specification [20]. But previous work performed the proof search deductively, applying manually-crafted inference rules. The concrete heaps of CHSL allow SObEq to instead search the proof space bottom-up, composing larger programs from smaller ones, and relying only on program evaluation, like a classical OE search does. CHSL gives SObEq two key advantages over classical bottom-up synthesis.

Correct composition via logic.

First, SObEq’s bank stores CHSL triples rather than programs. To build larger programs, SObEq also applies each component to smaller triples in the bank, but unlike simple programs, triples can only be combined according to the rules of the logic. This means programs always compose in a way that forms a valid evaluation, and there is no risk of evaluating a program in the wrong state. Moreover, OE can be determined by simply ignoring the program and comparing the specifications surrounding each triple: two programs with the same pre- and postcondition are guaranteed to behave the same, and will also compose the same. This maintains SObEq’s correctness in the presence of mutations.

Local reasoning.

Second, thanks to the Frame Rule, CHSL can reason about the local effects of programs: triples do not need to store parts of the state that the program does not touch. For instance, {n5}n++{n6;5} does not specify anything about arr, because n++ will behave the same way in all concrete states where the value of n is 5. This representation is compact, with each triple describing the behavior in many, potentially infinitely many, concrete states. This makes it much easier for SObEq to find programs that use variables after a mutation occurs, like our example, compared to the naive solution of enumerating using the full before- and after-states. This local reasoning also allows us to prioritize more general programs: ones that provide the same result and the same mutations for more concrete states. Discarding the less general programs lets us compact the space even further.

Our implementation.

We implemented SObEq as a synthesizer for JavaScript programs that can create non-pure expressions, including sequences of expressions. We evaluated our synthesizer on 63 benchmarks curated from the literature, from StackOverflow and from competitive programming website LeetCode. Compared to state of the art synthesis of general mutations, SObEq returns more concise and higher quality solutions, with less overfitting and without performing spurious operations, on both tasks that require and do not require mutations. Importantly, SObEq is both deterministic, meaning it will more reliably integrate with a larger tool, and proved correct, meaning that it will never discard all solutions to a task while exploring the space.

Contributions.

The main contributions of this paper are as follows:

  • A formulation of the PBE problem with memory mutations as the search for a derivation in a restricted Separation Logic, CHSL.

  • The SObEq algorithm, a bottom-up enumerative algorithm that searches for a derivation with a proven-correct OE-reduction.

  • An implementation of SObEq in a JavaScript synthesizer, and an evaluation on 63 benchmarks curated from the literature, StackOverflow and LeetCode, which shows that SObEq outperforms state of the art on a variety of synthesis tasks with and without mutations, and highlights the class of tasks for which a naive approach would fail.

2 Overview

In this section, we present SObEq through the example introduced in Section 1. First, we present SObEq’s notation for mutating programs, which is compact compared to the real space of programs. Next, we show how this helps us enumerate programs correctly. Finally, we show how ordering programs by generality helps us prune the space even further.

2.1 Mutating programs

Figure 2: The precondition bank used to enumerate the program in Figure 1(b). Each program is additionally labeled with its postcondition comprising an assertion and its result.

We begin with a reminder of programs and their construction in Observational Equivalence [2, 59], then show how we modify them to correctly reason about side-effects.

Classical OE-reduction and the program bank.

Bottom-up enumeration with an OE-reduction uses a component set 𝒞 comprising functions, operators, variables, and literals, to search for a program specified by a vector of input-output examples, , where each example is of the form ιω. For each program we compute its equivalence class’s label. In classical OE, we use its evaluation result, p(ι), for each ι in . In our example, the expression arr has the observed behavior [10,100,90,1,2] on the provided input, so its equivalence class is [10,100,90,1,2].

To enumerate the space of programs, we use a bank to store representatives of equivalence classes, as seen in Figure 1(a). For each function or operator f𝒞 with arity k, we collect all k-tuples of programs already in the bank and use them as arguments to apply f. This is usually done with some notion of iterations, e.g., as in Figure 1(a), by increasing number of AST nodes. When enumerating programs of AST size 5, then, the enumerator considers the component array dereference ([]), which has an arity of 2. Among pairs of programs collected from the bank with a total size of 4 will be the program arr, representative of the equivalence class [10,100,90,1,2], and the program n - 1, representative of 4. Applying array dereference yields arr[n - 1] with the observed behavior 2. Crucially, we do not have to evaluate the full program arr[n - 1] to get this value: we can use observed values [10,100,90,1,2] and 4 and only compute the final step.

When the enumeration finds a program from an equivalence class with no representative in the bank – i.e., no program in the bank has the same label – it is added to the bank to be used when constructing larger programs. This is the case with arr[n - 1], which is added to the bank at size 5. Alternatively, if a representative of the equivalence class already exists in the bank, we discard the new program. If a program’s observed values are equal to the provided outputs, the program is a solution. Since we only use discovered equivalence classes (i.e., programs) to construct larger programs, the effect of OE’s pruning compounds.

The problems with side-effects.

We quickly run into issues when 𝒞 contains functions or operators that can cause side-effects. The next step in solving our example is to enumerate arr.sort()[0] + arr[n - 1]. But, as we are about to see, using the classical approach this program will not be evaluated correctly, and it may even be pruned away.

We already saw how the subprogram p1= arr[n - 1] was enumerated and labeled 2, and p2= arr.sort()[0] will be enumerated similarly: at size 4, it will be composed from arr.sort() (size 2) and 0 (size 1), and added to the bank as a representative of 1.

The first source of trouble is that OE is a dynamic programming algorithm: when constructing the larger p2 + p1, the enumerator uses the memoized observed values to compute: 1+2=1. However, since p2 modifies arr, p2 + p1 actually evaluates to 99 on the input ι. Worse, this erroneous value means the synthesizer will consider p2 + p1, our solution, equivalent to p1 + p2 so it may be discarded. While a seemingly simple solution is to just evaluate the program in full each time, ensuring labels are correct, this is not only expensive, depending on the operation, but more importantly, insufficient. If the synthesizer sees the constant value 2, e.g., by enumerating 1 + 1, this would cause the larger – and seemingly equivalent – arr[n - 1] to be discarded. OE hinges on the enumeration only discarding a program when it already has a suitable replacement, but arr[n - 1] evaluated after arr.sort() would still have no equivalent program in the bank.

Rethinking the program representation.

Our problem was not distinguishing between the same program when evaluated using different values of arr. We also saw in Section 1 the need for distinction between programs that return the same value, but their effect on variables is different, e.g., n and n++ when evaluated using the same value of n.

We replace programs with triples in Concrete Heap Separation Logic (CHSL): the precondition records the variables before its evaluation, and the postcondition comprises an assertion recording the variables after evaluation and the result of the evaluation, like so:

{arr[10,100,90,1,2]}arr.sort(){arr[1,2,10,90,100];[1,2,10,90,100]}

From here on, we name the original array value arrorig and its sorted value arrsort, and use them in our examples and figures for brevity. In this representation, we can tell apart these two triples:

{arrarrorig}arr{arrarrorig;arrorig}{arrarrsort}arr{arrarrsort;arrsort}

where the first is arr evaluated on the original array, returning the original array value, and the second is evaluated on a state where the array has been sorted, returning the sorted value. We use the second triple to construct larger programs evaluated after sort().

Notice that since all values are concrete, CHSL does not need domain- or library-specific inference rules to connect the pre- and postcondition of the triple, but can rely entirely on evaluation.

Local reasoning.

These pre- and postconditions do not encode fully-concrete states, they leave n unconstrained. This makes our representation of the space more compact: the triple {arrarrorig}arr.sort(){arrarrsort;arrsort} takes advantage of the fact that only constraining arr is sufficient to reason about the behavior of this program. We only need to constrain the footprint of the program.

CHSL includes the notion of the separating conjunction to indicate separate parts of the heap (or, in our case, variable store) that can be reasoned about locally. Thanks to aliasing restrictions of the synthesizer’s target language detailed in Section 3.1, any two variables can be separated with . E.g., the assertion arrarrorign5 describes the concrete initial state. Once we replace the programs in our bank with CHSL triples, we can now further borrow from Separation Logic when considering evaluation sequences.

2.2 Enumeration and heaps

A bottom-up enumeration composes larger programs from smaller ones by selecting a sequence of arguments to apply functions or operators f to. In a classical enumeration, our bank contains programs, so if we wish to apply + to two programs in it, this is only a matter of creating the new program p1 + p2. When executing p1 + p2 on some variable valuation, the runtime first evaluates p1, then p2, then their addition. The classical OE enumeration duplicates this by summing the evaluation result banked for p1 to that banked for p2. Without mutations, this is correct for all orderings of arguments to + as all the programs evaluate from the initial state and to the initial state, so their results can always be memoized in the bank.

Now that our bank comprises CHSL triples, however, the enumeration needs to apply f to triples, not programs. Here, we have to be more careful. Let us consider two programs in our space, arr[0] and arr.sort()[0] enumerated with classical OE in Figure 1(a). In our new bank, they are now the triples:

tr1={arrarrorig}arr[0]{arrarrorig;10}tr2={arrarrorig}arr.sort()[0]{arrarrsort;1}

First, we consider what applying + to a pair of triples means: applying it to (tr1,tr2) as arguments means trying to create arr[0] + arr.sort()[0], or applying + to the program components of the triples. However, unlike the mutation-free case, not every pair of triples with those two programs can create this application. We expect a sequence of arguments to a component to describe a real evaluation of those arguments.

This is akin to creating a valid composition or sequence in Separation Logic, the postcondition of each argument must be equal to the precondition of the next one in the sequence; this means all mutations are accounted for. For (tr1,tr2), this is not a problem: arrarrorig is the after-state of tr1 and the before-state of tr2, which means that their sequence mutates the state from the precondition of tr1 to the postcondition of tr2, and they can be used in the application of any component, e.g., +, with the values memoized in the triple, so we add the two precomputed results 10+1, and get:

{arrarrorig}arr[0] + arr.sort()[0]{arrarrsort;9}

However, we cannot compose (tr2,tr1) in the same way: arr.sort() mutated the state, changing the value of arr to arrsort, so tr1, whose memoized result was computed where arr is arrorig, cannot follow it. Requiring arguments to a component to form a valid sequence ensures that only valid executions are enumerated by the synthesizer. The enumerator will not try to compose (tr2,tr1) under any binary operator, but can – and will – select (tr1,tr2) as arguments to +. Once a new program is composed, the enumerator invokes the language interpreter to evaluate the composition from the values of its arguments, yielding a triple containing the result and compound effect of the arguments and operation. We denote this invocation the Eval inference rule.

If we partition our bank of programs by the program’s precondition, as in Figure 2, then the enumerator can always create correct sequences of arguments by construction, using a triple’s postcondition to fetch all suitable next arguments according to their precondition.

Discovering more preconditions.

Let us consider the last subprogram of our target program: arr[n - 1]. It is going to be added with arr.sort()[0], which means that for that composition to be allowed, it must match the state after arr is sorted. Specifically, we consider the first argument of the array dereference, which will be needed to enumerate it:

tr3={arrarrsort}arr{arrarrsort;arrsort}

Importantly, this triple will never be enumerated if the enumerator only considers triples starting at a precondition describing the initial state, but without it we will never find the target program.

We have to ensure, therefore, that we also enumerate programs that have other preconditions. tr3 is a necessary building block for such programs.

To this end, whenever a new assertion is discovered by the enumerator, e.g., when a mutating component is evaluated, creating a never-before-seen assertion in the postcondition, as enumerating {arrarrorig}arr.sort(){arrarrsort;arrsort} does, we get ready to enumerate programs that begin in this new assertion: we create tr3, which any program starting at arrarrsort will need to incorporate, and add it to the bank, which also adds arrarrsort to the bank as a precondition. The next time the enumerator selects arguments to an operator, tr3 will be available. This lets the enumerator create programs that begin at arrarrsort.

Programs with different footprints.

Now that we have tr3, the first argument to the array dereference, let us consider the second:

{n5}n - 1{n5;4}

We enumerate this program at the initial value of n, i.e., without the need for its precondition to be discovered. However, when the enumerator tries to compose it with tr3 into a sequence of arguments to [], it runs into a problem: the two triples do not make a valid sequence. The assertion in the postcondition of the first argument, arrarrsort, is different than the precondition of the following argument, n5.

However, we clearly see that these two assertions deal with separate parts of the heap, and do not interfere with each other. This is where we draw on Separation Logic’s powerful tool: the Frame rule. Since these are separate parts of the heap, we can use the separating conjunction to create a unified heap in the pre- and postcondition of both triples. The more constrained triples are now composable, so the enumerator can apply array dereference and Eval, yielding:

{arrarrsortn5}arr[n - 1]{arrarrsortn5;100}

The full derivation is shown in Figure 3. Then similarly, the same extension – now adding n5 to tr2 and nothing to the second argument – is applied to allow the enumerator to compose the target program. This is an important strength of our technique: the same triple for n - 1 can compose with any value for arr – we will only enumerate it once!

We notice that in this case, this will discover a new assertion not as a postcondition as we did when sort() mutated arr but as a precondition. Once the program is added to the bank, the enumerator will continue to enumerate more programs with this new assertion as their precondition.

Figure 3: Using Frame and Eval to enumerate arr[n - 1]. Recall arrsort denotes the constant value [1,2,10,90,100] for brevity.

2.3 Equivalence and mutation

So far, we showed how the enumerator composes programs, but we did not show how the OE-reduction works. Now that we have a new representation of programs, we can re-define how each triple is observed and labeled, so that the label vector suits our needs.

In classical OE, the observed value for each example is the program’s evaluation result, but this is now insufficient: while in the precondition arr[10,100,90,1,2] the program arr[1] evaluates to 100, if we discard all the following programs that evaluate to 100 we will also discard the arr[n - 1] that is evaluated on a sorted array. Likewise, n and n++ with the same precondition are distinct in that they return the same value given the same precondition, but have a different effect on the state.

Generally, we can see that we can label a triple with its pre- and postcondition, including the result. Once those are identical, any programs that form a valid triple with them would be interchangeable, i.e., can be swapped for each other with no change to a larger evaluation. In Figure 2, the precondition component is expressed by the precondition the triple is stored under in the bank, while the postcondition appears above each program.

Generality of equivalence classes.

The equivalence class labels above define an equivalence relation for OE. However, in this setting, we can do even better: there are cases where we want to discard programs even though one is not equivalent to the other. Let us consider two triples:

tr4={n5}n - n{n5;0}tr5={emp}0{emp;0}

where emp is the unconstrained heap.

Both triples have the same evaluation result (0), and the same effect (no effect), but different assertions at the pre- and postconditions. But they are not completely different: if we select a frame axiom R=n5, we can bridge their difference: {empR}0{empR;0} is equivalent to tr4. Under this condition, we say that tr5 is more general than tr4: it describes a program that behaves the same over more concrete states. If we let the more general equivalence class subsume less general ones, we further compact our representation of the space while remainting correct. While we cannot tractably apply this additional reduction to the entire space, we can at least discard less general programs once more general ones exist in the bank.

Temporary values and variables.

This tactic cannot stand on its own: naively employing it means tr6={emp}5{emp;5} subsumes tr7={n5}n{n5;5} in the same way. However, in Figure 2 similar subsumptions do not happen: {emp}-1{emp;1} does not subsume {arrarrsort}arr[0]{arrarrsort;1} and {arrarrorign5}arr[n - (1+1)]{arrarrorign5;1}, which are still in the bank. Why?

We notice that if these subsumptions are allowed, the programs will no longer be interchangeable, and some will break entirely. Consider the application of ++: while {n5}n++{n6;5} uses tr7 as its argument, 5++ using tr6 does not compile. Likewise, replacing arr within arr.sort() with the array literal with arr’s value would produce a different mutation of the heap, indicting they should be separate programs. The difference, which must be encoded into our representation, is what is reachable from state variables, and if so, from where.

Our evaluation result therefore comprises two components: the value returned by evaluation, and the location of that value. tr3, then, returns arrsort@arr, and tr7 returns 5@n. This ensures the interchangeability within ++ or sort: the new program will not only compile, it will affect the same variable.

What, then, should be the location of a temporary variable, i.e., a variable that is not reachable from the stack? We must consider the location’s purpose: to separate programs enough to preserve interchangeability when the heap is mutated. Two temporary values, on the other hand, are (inherently) interchangeable, so we do not want to over-separate them which would inflate the number of equivalence classes in the bank. We therefore give all temporary values the same location: . This way, tr4 and tr5 both return 0@, preserving the subsumption, and tr6 returns 5@ which separates it from tr7.

Figure 2 omits this for simplicity, but does still separate {emp}-1{emp;1} with result 1@ from {arrarrsort}arr[0]{arrarrsort;1} with result 1@arr[0] and from {arrarrorign5}arr[n-(1+1)]{arrarrorign5;1} with result 1@arr[3].

Defining the Solution.

We are almost finished: like classical OE, we continue iteratively enumerating programs into the bank until a solution is found. Our final step is to reconsider the definition of a solution to a synthesis specification.

When enumerating a program in a non-mutating space with an OE-reduction using the examples’ inputs, a solution is a program that is labeled by (i.e., that evaluates to) the provided outputs. In SObEq this is insufficient: the enumerator can find a triple returning the correct value in some precondition. To be correct when evaluated on the concrete example inputs, we require that a solution’s precondition describe the concrete initial state ι. In our example, arrarrorign5 trivially describes ι – it is equal to it – but so does n5.

Thus, to be a solution, a triple must: 1) return the values of the example outputs, and 2) have a precondition that describes ι.

SObEq is, to our knowledge, the first proof-directed synthesis algorithm that works bottom-up. It is also the first bottom-up enumerative synthesizer for general mutating components with a correctness guarantee: if a solution exists in the space spanned by 𝒞, SObEq will find an equivalent solution.

3 Enumerative Synthesis and Heap Mutations

In this section we define the basic elements with which we construct SObEq: programs and their evaluation, and the synthesis task.

3.1 Concrete Heap Separation Logic

In this section, we introduce Concrete Heap Separation Logic (CHSL), a notation for operational semantics in the style of Separation Logic. Its syntax and two inference rules are showin in Figure 4.

Input variables Identifiers x𝒱
Literal values v𝒦
Assertion a𝒜 ::= empxv𝒜
Assertions a,P,Q𝒜××𝒜
Location l ::= xx[n],
n is an integer literal
Result r ::= v@l
Results r××
Postcondition {Q;r}
  {P1}p1{P2;r1}{P2}p2{P3;r2}{Pk}pk{P+k1;rk} cC,=arity(c)k  i.(c(ri1,,rik),Pi+k1)(ri,Qi)   {P1}c(p1,,pk){Q;r} Eval
  {P}p{Q;r}   {PR}p{QR;r} Frame
Figure 4: CHSL syntax and inference rules.

The heap.

In CHSL, we use a heap notation to reason about the concrete values assigned to variables from the input. We can reason locally about each individual variable because SObEq’s target language does not allow assignments, and so as long as nothing is aliased at the input – we can enforce this when generating the synthesis task – each variable is separate from all others and can be separated from them with a separating conjunction, .

CHSL assertions.

CHSL reasons about concrete values assigned to variables. As such, a CHSL assertion comprises heaplets of the form xv where v is a concrete value.

This also means we can trivially transform a state (partial function) σ into an assertion describing an identical heap s.t. x1σ(x1)xnσ(xn) for all xidom(σ). We use the function interchangeably with its assertion equivalent, i.e., σ as a shorthand for the assertion above.

For a vector of states σ, corresponding to the vector of examples , CHSL likewise uses a vector of assertions. We therefore also define a vector separating conjunction s.t. PR=PiRii.

Values with location.

In the presence of mutations, we must also consider that some values are reachable from our heap. As shown in Section 2, this is important for components that modify their arguments (e.g. arr.sort()), especially components that return a self-reference and are therefore composable, as in arr.sort().reverse(), where both sort and reverse mutate arr.

To support this, a result r comprises two components: a value and that value’s location, l that can be a variable, a variable at a concrete index, or . We use to indicate a value not reachable from any variable, i.e., a temporary value. We decompose the result as r=v@l. As with variable values, the values in results are concrete.

Triples.

A triple in CHSL comprises a precondition P𝒜k (where ||=k), the program p, and a postcondition. Our postconditions have two components: i) an assertion Q𝒜k, and ii) the evaluation results of p on P. Vectors of assertions and results form a conjunction: {P}p{Q;r} means all triples {P1}p{Q1;r1},{P2}p{Q2;r2},,{Pk}p{Qk;rk} hold.

We call the subset of each assertion Q whose values are different from P the effect of the program on P. For example, in the triple {x1y2}x++ + y{x2y2;3} the effect is x2.

We consider {P}p{Q;r} to be a valid triple if it accurately tracks the evaluation of p on P. In other words, assuming a (deterministic) small-step operational semantics for programs the behavior of which is defined by the interpreter of the language: :(,𝒜)(,𝒜), we say {P}p{Q;r} is valid if 1ik.(p,Pi)(ri,Qi).

Because SObEq will be enumerating triples, it is crucial that it can construct valid triples.

The Eval rule.

When constructing a new triple, the Eval rule denotes the use of the interpreter to ensure the resulting triple is valid. Because an evaluation of an AST node will first evaluate its children in sequence, then use the results to evaluate the node’s operation, Eval requires a sequence of triples. As in the classical Hoare Logic Sequence rule, Eval requires matching midconditions. However, since we often want to form a sequence from triples where QiPi+1, we use the Frame rule to find a larger composed heap where the midconditions do match.

The Frame rule.

Generally, all triples in the space contain only their footprint in their pre- and postcondition. As the example in Figure 3 shows, in this compact representation midconditions may not match. We recall that the frame rule allows us to reason about a larger heap by equally extending the pre- and postcondition with an additional assertion called the frame axiom.

Fortunately, in CHSL, frame axioms are very easy to find. To prepare a pair of arguments for Eval, then, one of two things is true: either for every i, Q1iP2i is still a partial function (i.e., if the same variable appears in both, it maps to the same value) and R1i,R2i are found by their difference, or the two triples cannot form a valid sequence.

Combining Eval and Frame.

When the SObEq enumeration will apply a function or operator f to k child triples, this will combine the two rules. Since not all valid sequences of possible arguments to f have matching midconditions, we apply the Frame rule to the entire sequence: given a set of k=arity(f) triples tri={Pi}pi{Qi;ri}, Eval is applicable if there exist k frame axioms R1,,Rk where Q1R1=P2R2,Q2R2=P3R3,,Qk1Rk1=PkRk. Some of these Ri may be emp. Now that the midconditions of the triples match, Eval can be applied, yielding a triple that evaluates from the new shared precondition P1R1, and computing the result and precondition for f(p1,,pk).

3.2 Problem definition

Since SObEq uses CHSL triples to solve a PBE problem, we must re-state the PBE problem for CHSL. In this section we define what it means for a CHSL triple to represent our target program, define a translation from a PBE task, and explain how additional specifications, constraining the program’s effects on the state, can also be supported.

The Programming by Example (PBE) task is usually defined as follows: given a vector of input-output examples =ιiωii, find a program p over a component set 𝒞 where for every input ιi, evaluating p on ιi, denoted p(ιi), is equal to the provided output ωi. We first want to translate this into a goal of the shape {P}_{Q;r}.

Examples in CHSL.

We denote the vector of input states I=ιiιiωii. The direct translation of to CHSL, then, is {P}_{_;ωi@_i}, where R.PR=I. In other words, the precondition needs to describe the initial states (it is a footprint, so it may not describe all input variables), any postcondition will satisfy our specification, and the correct values can be at any location.

Constraining effect.

Users may want to constrain specific mutations as well as the value. We let the user pair with each example any constraints on the effect, e.g., the value of n must be 9, or arr must end in its initial value [10,100,90,1,2]. To do this, we let the user provide an effect constraint q:𝒱𝒦, constraining the target value for any variable, where if xdom(q), x is unconstrained.

The SObEq task.

A PBE user does not need to know about CHSL. Much like the original , they provide a vector Φ where each φΦ is a pair of example and effect constraint: φ=(ιω,q). If they are unconcerned with effects, this is identical to providing .

We then turn Φ into a synthesis goal in CHSL:

Definition 1 (SObEq synthesis goal).

Given a specification Φ, our goal 𝒢Φ is:

𝒢Φ={P}_{qi_i;ωi@_i}, where R.PR=I

The SObEq task is to find a triple {σ}p{σ;r} that matches 𝒢Φ.

3.3 The program space

Now that we have SObEq’s specifications, we define the space in which it performs the search. We consider the component set 𝒞 as a union of three sets: 𝒱lits, where x𝒱 are the variables in I, llits are literals in the component set, and f are the functions and operators that can be applied to subprograms (arity(f)>0). The full program space of a synthesizer with a component set 𝒞 is then every program using the components in 𝒞. We denote this program set 𝒞. This set is infinite and highly redundant, e.g., contains both x + 1 and 1 + x + 0, and in our mutating domain, both arr.sort() and arr.sort().sort().

Classical Observational Equivalence.

In a problem domain with no side-effects, observational equivalence unifies programs with the same evaluation results on the example inputs. This is expressed in the equivalence relation that deems two programs p1,p2 (observationally) equivalent if ιω.p1(ι)=p2(ι). The OE-reduced space 𝒞OE is then defined using . This idea was first suggested by Udupa et al. [59] and Albarghouthi et al. [2] and is in wide use today.

A classical bottom-up enumeration (CBE), keeps a bank of programs, and the vector of observed results is used to label the program’s equivalence class. Only one program with each vector is added to the bank, meaning any observationally equivalent programs are discarded.

A program space with mutations.

In the presence of mutations, the space of possible programs no longer comprises only programs evaluated in the initial states I. As we saw in Section 2, composition in this space requires subprograms that are evaluated in a modified state.

The full program space when includes mutations is constructed inductively, building larger valid triples from smaller ones via applications of Eval, and the Frame rule when necessary. We denote the possible values for each x𝒱 as 𝒦x. We define the full space of programs 𝒞:

Definition 2 (Full program space).

We define the full program space 𝒞 as i0𝒞(i) where 𝒞(i) is defined inductively as follows:

𝒞(0)={{emp}l{emp;v@}|llits,v𝒦 is the value of l}{{σ}x{σ;vi@xi}|x𝒱,σ=xvii,v=𝒦x××𝒦x}𝒞(n)=𝒞(n1){{P}f(p1,,pk){Q;r}|f,arity(f)=k,{σi}pi{σi;ri}𝒞(n1),R1,,Rk.Eval of f on {σiRi}pi{σiRi;ri} is {P}f(p1,,pk){Q;r} }

Finally, given 𝒞, we can define the OE-reduced space that SObEq will enumerate.

3.4 SObEq: observational equivalence with side effects

As we showed in Section 2, classical OE’s is no longer useful for the program space in Definition 2. We need our equivalence relation to encompass every aspect of the goal φ (Definition 1).

The generalized formalization of OE [45] defined two properties for equivalence, interchangeability: two equivalent programs can be used interchangeably within a larger program, and consistency: two equivalent programs will either both satisfy or both not satisfy the specification.

To satisfy consistency we must include everything that participates in Definition 1. The value returned by the expression, and the postcondition, which includes any constrained effect. We also need the state on which the program will be evaluated to check whether it describes I.

As we saw in Section 2, to satisfy interchangeability we need the location of the value as well: while the literal [2,3,1] and a variable x whose valuation is [2,3,1] return the same value, they are not interchangeable under a larger program like ?.sort(), where they would lead to different postconditions. Moreover, to be able to swap a program in, its postcondition must be the same.

Combined, then, our equivalence relation simply compares the triple excluding the program:

Definition 3 (SObEq equivalence relation).

The equivalence relation for Φ=(ιiωi,qi)i is

{σ1}p1{σ1;r1}Φ{σ2}p2{σ2;r2}σ1=σ2σ1=σ2r1=r2

Or in other words, equivalence does not look at the program, only at its specifications.

Subsumption of triples.

Finally, we consider the generality ordering of triples and (implicitly) of their equivalence classes, which will let us find more general programs.

In Section 2, we considered the two triples tr4={n5}n - n{n5;0@} and tr5={emp}0{emp;0@}. While tr4 and tr5 are not observationally equivalent under Definition 3, as empn5, tr5 is more general than tr4: it produces the same result (0@) and the same effect () from more concrete states. If tr5 can absorb tr4, we can arrive at a more compact representation of the space. To this end, we formally define this generality: the partial order of programs that differ by only a more general pre- and postcondition.

Definition 4 (Generality ordering of triples).

Given a specification Φ, we define

{P1}p1{Q1;r1}Φ{P2}p2{Q2;r2}R.{P1}p1{Q1;r1}Φ{P2R}p2{Q2R;r2}

I.e., the more general way to achieve the same result and effect is considered larger.

While Φ is clearly not an equivalence relation (the existence of a frame axiom is not symmetrical) so it cannot be used to partition the space into classes, we notice that replacing a triple with one that is larger according to Φ does preserve the two properties of an OE-relation, interchangeability and consistency, in one direction. Originally [45], interchangeability states that swapping two equivalent programs within a larger program will yield an equivalent larger program, and consistency states that two equivalent programs either both satisfy a given specification φ or neither does. Under , however, a slightly weaker – but still useful – version holds.

Interchangeability under Φ means that given f,arity(f)=k, k triples trj={Pj}pj{Qj;rj}, 1jk, and k assertions Rj s.t. Eval can be applied on f and {PjRj}pj{QjRj;rj} to infer {P}f(p1,,pk){Q;r}, and a triple tri={Pi}pi{Qi;ri} s.t. triΦtri for some 1ik, then:

  1. 1.

    There exists Ri s.t. Eval can still be applied on f when replacing {PiRi}pi{QiRi;ri} with {PiRi}pi{QiRi;ri} to infer {P}f(p1,,pi,,pk){Q;r}, and

  2. 2.

    {P}f(p1,,pk){Q;r}Φ{P}f(p1,,pi,,pk){Q;r}

Consistency under Φ means that for goal 𝒢, if {P1}p1{Q1;r1}Φ{P2}p2{Q2;r2}, then if there exists R1 s.t. {P1R1}p1{Q1R1;r1} matches 𝒢, then there exists R2 s.t. {P2R2}p2{Q2R2;r2} matches 𝒢. It’s important to notice that the of the equivalence version of consistency is now : if {P1}p1{Q1;r1} is a solution, so is {P2}p2{Q2;r2}, but since {P2}p2{Q2;r2} is more general P2 might describe I even if P1 did not.

Lemma 5.

Interchangeability and consistency are preserved under .

The proof of Lemma 5 is found in the extended version of the paper.

With the program space 𝒞 and the SObEq equivalence relation and generality ordering, we are ready to search for a program, i.e., to construct the OE-reduced space via enumeration.

4 The SObEq Enumeration

In this section, we describe the construction of the OE-reduced space. We first describe the structure of the space, then the enumeration, and finally how the enumeration can find equivalence classes that, by virtue of their generality, subsume other equivalence classes.

Classical Bottom-Up Enumeration.

A bottom-up enumeration consists of applying each f to k-tuples of child programs from the bank of previously seen programs, where arity(f)=k. Each newly enumerated f(p1,,pk) is evaluated, and if an equivalence class labeled by its observed values is not yet represented in the bank, it is added to the bank. Enumeration is done according to some ordering, e.g., height, number of AST nodes, or probability score [7]. This is usually facilitated by the structure of the bank to allow for easier retrieval of relevant child-programs.

The precondition bank.

In our setting, we want to bank discovered triples according to their precondition: a bank section a contains all discovered programs whose precondition is a.

There are two reasons for this: first, when composing a k-tuple of arguments to f, finding a sequence Eval is enabled for can be done constructively, without considering all possible k-tuples. For a sequence of length k1 with postcondition assertion Qk1, we can preemptively rule out assertion vectors a with different values assigned to a variable appearing in both Qk1 and a, and will never form a valid sequence. In Algorithm 1 we call this construction CollectChildren.

Second, and even more importantly, this lets us track what assertions have been discovered by the enumeration. If the enumeration produces a triple {a}p{a;r} where a or a have not yet been seen by the enumeration, then a new bank will be initialized for a (or a, resp.).

Initializing a new bank.

A new assertion a can be discovered by the enumeration in one of two ways, as a pre- or a postcondition. For example, the derivation in Figure 3 combines {n5}n - 1{n5;4} with {arrarrsort}arr{arrarrsort;arrsort} into {arrarrsortn5}arr[n-1]{arrarrsortn5;100}. Both the compound pre- and postcondition might be new for the enumeration, in which case they do not have a section in the bank.

If a is discovered as a precondition, we naturally add a section of the bank to bank the newly enumerated program. If a was discovered as a postcondition, then it may be a result of an effect, which means some of its atomic heaplets may not be in the bank yet. In this case, we create a section for a in the bank, but also add its atomic heaplets to the bank if they are new: we construct {ax}x{ax;rx} where ax=xai(x)i and rx=ai(x)i for each x in a, and add it to the bank if ax is not yet in the bank. In Algorithm 1, this is collectively denoted as InitBank(a).

At the beginning of the enumeration, the bank is trivially initialized with the empty assertion emp for literals and with xιi(x)i for each variable x𝒱.

Enumerating terms.

The main enumeration is not very different from the CBE enumeration: the enumerator selects f with arity(f)=k, and composes as arguments for it all k-tuples of triples {Pi}pi{Qi;ri} that Eval can be applied to from triples already in the bank.

The application of Eval then constructs a new triple {P}f(p1,,pk){Q;r} that can then be observed: we use its pre- and postcondition assertions and its results as the label for its equivalence class. This label is searched for in the bank for precondition P. If the bank for P has an equivalent program, {P}f(p1,,pk){Q;r} is discarded, and if no equivalent program was previously seen, {P}f(p1,,pk){Q;r} is added to the bank for P.

We also have at our disposal Φ (Definition 4), which means we reduce the space even further.

Using more general triples.

We want to use the Φ relation to let more general triples subsume less general ones in the course of enumeration, even when they are not strictly equivalent. However, this can happen in one of two cases: (i) the less general triple being enumerated after the more general triple is in the bank, and (ii) a more general triple being enumerated later.

Handling (i) is simple enough: when enumerating {σ}p{σ;r}, even if there is no representative for the equivalence class of {σ}p{σ;r}, the enumerator checks whether some {σ2}p2{σ2;r2} s.t. {σ}p{σ;r}Φ{σ2}p2{σ2;r2} already exists in the bank, and if it does, discards {σ}p{σ;r}. This is efficiently done by changing the assertions to more general ones and searching the bank.

Handling (ii), however, is a larger feat. In theory, once we enumerate the more generic triple {σ}p{σ;r}, we can replace all less-general triples {σ2}p2{σ2;r2} with {σ}p{σ;r}, propagating this change to any compositions that use p2. This would be prohibitively costly on its own before the implications of Lemma 5: this replacement can weaken the pre- and postconditions of compositions, changing their place in the bank. This cascade might even weaken the precondition a triple that did not satisfy Φ enough to describe I and become a solution. As such, we elect to only incorporate (i) into SObEq.

4.1 The enumeration algorithm

Algorithm 1 The SObEq enumeration.

We can now compose the complete SObEq enumeration, seen in Algorithm 1.

Lines 1–2 of the algorithm initialize the enumeration by calling InitBank, described above, to create the basic programs for emp and for each of the variables’ initial values.

Line 4 selects the next component for application; components with an arity of 0 are handled exclusively by InitBank. Next, on line 5 CollectChildren constructs all k-tuples of children that Eval is enabled for. Since it is a part of the construction process, CollectChildren returns each k triples already updated with the frame axioms R1,,Rk to a unified composed heap.

Line 6 then applies Eval to construct the new triple. If the new triple is a solution to the task, it will be returned by line 7. If not, line 8 checks whether an equivalent or subsuming program already exists in the bank via its equivalence class label: if some existing key in the bank can be made equivalent to the current program via a frame axiom. For example, the program {n5}nn{n5;0@} will be enumerated, and will then be tested twice: for R=emp, (n5,n5,0@) will be searched for in the bank, and also for R=n5, (emp,emp,0@) will be searched and found (for the program 0), so {n5}nn{n5;0@} is discarded. A precondition a in the bank is deemed relevant if there exists a frame axiom R s.t. R both matches a to σ and all values in R are unchanged in Q (i.e., R truly is a frame). Every a in the bank lookup uses the relevant equivalence class label, with a as its precondition and subtracting the frame axiom from its postcondition assertion. If no triple is found, the new triple is added to the bank. If either the pre- or postcondition assertion do not exist in the bank, InitBank is called for them to generate triples for their variables.

SObEq as a base enumerator.

Many synthesizers [16, 45, 36, 5] use a base OE enumeration of expressions to synthesize additional language constructs, e.g., conditionals, higher-order functions, and assignments. These techniques are parameterizable in their base enumerator, and so could leverage SObEq to expand their target language.

Correctness of SObEq.

We show that Algorithm 1 is correct: that if a solution p exists in the unreduced space, SObEq’s enumeration will find a solution observationally equivalent to p or more general than p.

Theorem 6 (Correctness of SObEq).

If a valid solution {σ}p{σ;r} to goal 𝒢Φ exists in 𝒞, then there exists {σ′′}p{σ′′′;r} that matches 𝒢Φ in the SObEq-reduced space, and it is reachable by SObEq’s enumeration.

In the extended version of the paper we prove Theorem 6 in two steps: first we define the (unreduced) space reachable via enumeration and show that any solution will still be in that space, then we apply the reduction of Φ to the space, defining the SObEq-reduced program space 𝒞S, and prove that any solution will have an observationally equivalent or more general under Φ solution in 𝒞S.

5 Implementation

We implemented a SObEq synthesizer for JavaScript programs in 3,700 lines of Scala code. The set of functions and operators contains 54 components for integers, strings, arrays, and sets, including nine mutating operations: postfix increment for integers, the array functions sort, reverse, splice, push, pop, shift, and the set functions add and delete. JavaScript strings are immutable, so includes no mutating components for strings. also includes a sequence operator for any two expressions. We allow a task to provide additional string constants for lits, common in benchmarks from the literature [4, 7] or interaction models [17, 16].

Preserving immutability.

Effect constraints in Φ can specify that a variable has not changed in the postcondition of the solution. But effect constraints cannot distinguish between, e.g., x and  --(++x) because we only reason about the pre- and postcondition, not the path between them. This is generally desirable as it marks  --(++x) as redundant, but this will only happen when applying --. The user may want x to always be immutable.

This can be enforced by the enumeration. We add to our task specifications the ability to mark a variable as immut. Once the user specifies x as immutable, there is no reason to construct ++x at all: a triple {σ}p{σ;r} where σi(x)iσi(x)i is discarded immediately, and σ is not initialized in the bank, since any programs with the precondition σ assume a mutation of x.

Goals without an output.

Under CHSL it is possible to express a goal that has no specified output, only a specified effect. This is desirable when, e.g., specifying that a value is inserted into a list, but what the insertion returns is irrelevant. If a benchmark has no specified outputs, i.e., Φ=(ιi,qi)i, we create the synthesis goal 𝒢Φ={P}_{qi_i;_}. The lack of an output does not change anything in how programs are enumerated and the goal is matched in the same way.

6 Experimental Evaluation

We evaluate SObEq on our implementation. All our experiments used a server with Intel Xeon Gold 6338 2 GHz111This processor has 32 cores/64 threads, but our implementation is single-threaded. and 128GB of RAM, with the JVM maximum heap size set to 110 GB.

Research Questions.

We aim to answer the following research questions:

  1. RQ1

    How does SObEq compare to the state-of-the-art in syntax-guided synthesis of heap mutations?

  2. RQ2

    What is the overhead of SObEq compared to a classical OE enumeration?

  3. RQ3

    How effective is SObEq at pruning the space of programs?

  4. RQ4

    How does SObEq compare to enumerating with concrete states?

6.1 Benchmarks and baselines

We evaluated SObEq on a set of 63 benchmarks, curated from previous synthesizers [52, 7, 16, 22], programming exercises on LeetCode.com, and questions on StackOverflow.com. We discarded benchmarks that use types other than strings, arrays and sets, i.e., that would be inherently unsupported by our implementation. As SObEq supports a straight-line sequence of expressions, we also discarded benchmarks from Shi et al. [52] and Ferdowsifard et al. [16] if their solution used assignments to intermediate variables, loops, or conditionals.

We split the benchmarks into two sets: May Mutate: 45 benchmarks that have a non-mutating solution in 𝒞. Must Mutate: 18 benchmarks that require mutation: benchmarks with a non-empty effect constraint or cannot be solved without mutation in 𝒞. Of these, five are benchmarks from May with a constrained effect, ruling out non-mutating solutions. For RQ2, we also create Pure: a version of May where all variables are marked as immut.

On average, May benchmarks have 2.8 examples (min 1, max 4) and 1.4 variables (min 1, max 3). Must benchmarks have 2.7 examples (min 1, max 5), 2.3 variables (min 1, max 4), and 1.2 variables specified in the effect (min 0, max 2). Five Must benchmarks have no specified outputs. More information about our benchmarks is provided in the extended version of the paper.

Selecting a baseline.

We considered several synthesizers as baselines for SObEq, ruling out synthesizers built for a single example [22], synthesizers that mutate by assigning synthesized pure expressions [16, 17, 59], and tools hinging on Java’s rich type system [15, 61]. FrAngel [52] is the current state of the art in synthesis of general programs with heap mutations: it is specified with input-output examples, it has the same ability as SObEq to declare a variable as immutable, and its stochastic search carries no additional requirements for specification other than the ability to evaluate – just as SObEq does. This makes it the most suitable baseline.

FrAngel as a baseline.

FrAngel targets Java collection implementations, which have many more methods than their JavaScript counterparts. This means FrAngel and SObEq can yield solutions that are entirely incomparable. To ensure an apples-to-apples comparison, we created new Java classes that mimic JavaScript types in Java, recreating the same 𝒞 used by SObEq’s implementation. We then defined our benchmark set over these types.

In addition, in order to force FrAngel to create a straight-line sequence of expression statements as SObEq does, we also removed from its enumeration loops, conditionals, and assignments. Notice that removing choices from its random sampling only improves FrAngel’s odds of finding a solution; before this additional change, FrAngel’s run times were considerably worse.

FrAngel also has the option to set inputs as immutable, which let us encode immut variables, and an option to set a target value for an input, which we used for encoding effect constraints in the benchmark. This means FrAngel can run on every benchmark in SObEq’s suite.

Selecting timeouts.

We consider two timeouts for our runs. First, we set our timeout at 60 minutes, extending the original 30 minute timeout used to evaluate FrAngel. However, since we aim SObEq to be used in interactive systems, and by other OE-based systems as the base enumerator, we also consider results at an interactive timeout or a timeout needed for the synthesizer to run in a user-facing tool and not create a disruptive interruption [43]. Building on recent interactive synthesizers, we choose 7 seconds as our interactive timeout.

6.2 RQ1: Comparison to FrAngel

We compare SObEq to state of the art FrAngel, using an equivalent component set as detailed in Section 6.1. FrAngel’s algorithm is stochastic, so we ran FrAngel five times on each of the 63 benchmarks. We consider a benchmark solved if the synthesizer returns a solution that is correct on all examples. While some benchmarks from the literature contained a gold standard solution, we only used those as part of the analysis of result quality. We compare the quality of SObEq’s solutions to FrAngel on three counts: the size of the solution, is it overfitted, and does it contain spurious operations. These are shown in Figure 5. In addition, we compare the run times of the two tools at both an interactive 7s and a one-hour timeout.

Figure 5: RQ1: Comparing SObEq’s solutions to FrAngel’s on size (left, center) and quality (right). Solution sizes are per-benchmark: each SObEq solution has up to five corresponding FrAngel values. Above the line means SObEq is smaller. We consider both the final results (left) and solutions before postprocessing/cleaning steps (center). Solution quality (right) indicates the percentage of solutions with overfitted or spurious operations, determined by manual inspection (lower is better).

Size of the solution.

While the size of solutions is not, in itself, a good proxy for quality, it is telling in combination with overfitting and spurious operations. SObEq postprocessed solutions had an average of 6.2 AST nodes at an interactive timeout and 6.9 at a 1-hour timeout (max 14, 25, resp.) where Must benchmarks were generally smaller – 5.7 nodes at the interactive timeout and 6.1 in an hour – compared to May benchmarks – 6.4 and 7.1 nodes.

The left and center of Figure 5 show the AST size for each SObEq solution compared to each of FrAngel’s solutions for the same benchmark. Overall, FrAngel’s solutions are much larger than SObEq’s. There are two benchmarks where SObEq’s solution is considerably larger than FrAngel’s: Is All Positive? (originally from FrAngel), both SObEq’s and FrAngel’s solutions were overfitted. In Get first name from name with comma (originally from Probe [7]), FrAngel returned two overfitted solutions similar to SObEq’s, both appearing closer to the diagonal line, and three times found a solution closer to the gold standard solution, which is shorter, though never the gold standard solution itself. (The two identical solutions and three identical solutions overlap in Figure 5.) An interesting trend in SObEq’s solutions was that the enumeration order favored extracting postfix increment to a sequence. E.g., instead of buffer[idx++]++, idx SObEq returns the larger but functionally identical buffer[idx]++, idx++, idx.

Overfitted solutions.

We define overfitted solutions as solutions that do not generalize beyond the provided examples. This was determined by manual inspection of both SObEq and FrAngel solutions, and with reference to task descriptions and gold standard solutions, if they exist.

Two SObEq solutions were overfitted, and one additional benchmark was close to a gold standard solution for the task, but since all its inputs were of length 4, it replaced str.length in the gold standard solution with 4. Altogether, 3 of SObEq’s 54 solutions were overfitted (6%). Of 279 FrAngel solutions (up to five per benchmark, which sometimes differ) 47 were overfitted (17%). As the right of Figure 5 shows, both tools were more prone to overfitting in May benchmarks.

We bring several examples of FrAngel’s overfitted solutions here.

For example in the Max and Min benchmark, a simplified version of the example in Section 1, all four solutions (one run timed out) are overfitted and different. Two of them are:

return arr.join(arr.join(Str(""))).replaceAll(Str(""), Str(""))
.concat(arr.join(Str(""))).indexOf(arr.slice(Int(0), Int(1)).join(Str("")));

where the result number (max plus min) is generated by indexOf, and

arr.reverse();
return arr.join(arr.sort().join(Str(""))).replaceAll(Str(""), Str("")).replace(
arr.join(arr.join(Str(""))).replaceAll(Str(""), Str("")), Str("")
).length();

where it is generated by the length of a string.

We also consider FrAngel’s results on the three benchmarks that SObEq overfitted on: For the Is All Positive? benchmark mentioned above for its large solutions, both SObEq and FrAngel overfit to the examples in two different ways. FrAngel overfit all give solutions, returning true only for arrays where the smallest element is 1, and SObEq looked for a ’-’ in an overfitted location in the printed array. For get from name with comma, four of five FrAngel solutions were not overfit. The fifth is:

return _arg_0.split(Str(",")).sort().slice(Int(1).subtract(_arg_1),_arg_1).join(Str(""));

where the input string is split into words but then sorted before extracting the 1-_arg_1th word. This happens to be true because all example inputs have a decreasing lexicographic ordering of the words. In Negative Index Of, where SObEq string lengths to 4, three of FrAngel’s solutions are not overfitted, but two rely on string < comparisons to between variables and constants.

Spurious operations.

We define spurious operations as terms in the AST that do not contribute to reaching the target values: e.g., sorting an already-sorted list, mutating a temporary value that is then discarded, etc. Of the 279 FrAngel solutions, 37 had spurious operations (13%). We bring a few examples of these here.

One of the solutions FrAngel found for the buffer-add benchmark was:

arr.splice(beg++, 1, toadd).reverse();
return end + 1;

Here, splice removes a sub-array from arr, replaces it with toadd, and returns it. The solution then calls reverse() on this temporary value, which is immediately discarded.

Likewise, in three of the five solutions to rotate queue:

mod.equals(inp.push(inp.shift()));

where input variable mod is compared to the result of the target program (shifting the first element out of the array and pushing it back in at the end).

(a) Must benchmarks (interactive). (b) Must benchmarks (1 hour).
(c) May benchmarks (interactive). (d) May benchmarks (1 hour).
Figure 6: RQ1: The number of benchmarks SObEq and FrAngel solve over time. FrAngel is stochastic: median run for each benchmark shown by line with best and worst runs as range.

For set add-delete, which asks for an in-place mutation of a the set seen, one solution is:

seen.delete(s2).compareTo(seen.add(s1).has(null));

where both has and compareTo are useless, computing a result that is not returned.

And, of course, this problem is not disparate from overfitting; the only successful run of FrAngel for fill fibonacci window found this solution:

out.reverse().spliceReplace(
Int(1),
out.concat(out).length().subtract(
Str("").concat(out.join(Str(""))).lastIndexOf((new IntArray(Int(1))).join(Str("")))
) ).shift();

where reverse().spliceReplace(1,_) is a consistent, if odd, way to remove the first element and add something after the second element by swapping the elements and then replacing the (now) second one. The second argument to splice, which is expected to compute the next fibonacci element, overfits to the examples, and shifting on the discarded return value of splice is spurious.

SObEq, in contrast, had no spurious operations. Some are avoided thanks to OE: FrAngel can draw programs that perform round trip operations like reversing multiple times (3 of 5 solutions to the example from Section 1 do this) or sort a sorted list. Enumerating with OE discards these, as they return the heap to a state it was already at and return the same value.

Other spurious operations are avoided because of the implicit bias of a bottom-up search toward small programs: the program inp.push(inp.shift()) already satisfies the input, so a bottom-up search would not try to compose it into a larger program adding on spurious operations. Likewise, the enumeration would try to compose arr.splice(beg++, 1, toadd) to end + 1 before trying it with the larger arr.splice(beg++, 1, toadd).reverse().

Runtimes.

We compared SObEq’s runtimes to FrAngel’s best, worst and median performance on each benchmark in Figure 6. In an interactive timeout, SObEq outperforms FrAngel for the majority of the 7 seconds. At the 7s limit, SObEq solves 45 benchmarks, 7 benchmarks more than FrAngel’s median run, whereas FrAngel solves between 34 and 45 benchmarks. We also notice the main disadvantage of FrAngel, which is the variance introduced by the stochastic algorithm: a difference of 30% between its best and worst cases. At the 1-hour timeout, SObEq solves 54 benchmarks, compared to FrAngel’s 53–60 (median 55).

Figure 6 separates benchmarks into sets to ensure performance is not at the expense of one set of benchmarks over the other, important for a general-purpose synthesizer. At an interactive timeout, SObEq performs on Must benchmarks as well as FrAngel’s best run, both solving 12 of 18 benchmarks, and solves one more than FrAngel’s median run of 11. For May benchmarks, SObEq solves 33 of 45, while FrAngel solves between 25 and 33 (median 27). After an hour, SObEq outperforms FrAngel’s median run for Must benchmarks, solving 15 benchmarks to FrAngel’s 14 and is within FrAngel’s range for both benchmark sets.

To answer RQ1, SObEq returns higher quality solutions than FrAngel at comparable runtimes, without the nondeterministic variance in both time and expected solution.

(a) Interactive timeout.
(b) 1-hour timeout.
Figure 7: RQ2: Comparison with CBE: time to termination for each Pure benchmark with SObEq compared to a classical bottom-up enumeration. Points at (timeout,timeout) not shown.

6.3 RQ2: Overhead of SObEq

A more general algorithm inevitably carries an overhead on a (now) special case compared to specialized algorithms for that case. This is often because of additional work by an algorithm making fewer assumptions, but in OE this can also take the form of a more refined equivalence relation, which unifies fewer programs creating a bigger space to explore. We wish to measure this overhead in order to quantify the cost of the changes made to support mutations in SObEq.

If 𝒞 has no mutating components, SObEq would not discover assertions outside the initial preconditions and their compositions. Such a search space is directly comparable to a classical bottom-up enumeration (CBE). This lets us measure the cost of the more fine-grained equivalence classes caused by using results, not just values, in equivalence class labels, and the additional overhead of creating a compound heap for sub-programs using different variables.

To measure the overhead, we restricted our 𝒞 to only non-mutating components, and compared its performance on Pure benchmarks to an enumerator that uses only the initial concrete state and uses the classical value-based equivalence relation for an OE-reduction. This is equivalent classical OE with a height-based enumeration. By removing the mutating components for this evaluation, we avoid the correctness issues in classical OE, while controlling for performance differences caused by having different grammars.

(a) Interactive timeout.
(b) 1-hour timeout.
Figure 8: RQ3: Run times of disabling different pruning mechanisms within SObEq: any OE, OE outside the initial state, and discarding programs according to .

In this restricted context CBE outperforms SObEq for most benchmarks (time comparison shown in Figure 7). In an interactive timeout, CBE solved the benchmarks an average of 1.52 times faster (median 1.5). CBE was also able to solve one additional benchmark, Max and Min, within the 7s timeout whereas SObEq took 10.5s, and it is therefore not shown in Figure 7. The biggest difference, where CBE was 2x faster or more (up to a maximum of 2.16 times faster), were all in benchmarks with runtimes of under 130ms for both enumerators.

In the full hour, CBE solved the benchmarks an average of 1.48 times faster (median 1.5), and finished with a two-benchmark lead: there are three benchmarks CBE solved that SObEq did not, and one benchmark that SObEq solved and CBE did not.

For RQ2, we find that the overhead of SObEq’s more fine-grained equivalence classes and creating compound heaps is not negligible compared to CBE.

6.4 RQ3: SObEq’s effectiveness in pruning

To test SObEq’s effectiveness at pruning its space we compare SObEq to two ablation enumerations: 1. No-OE: an enumeration that discards no programs, i.e, enumerates 𝒞reach. 2. Pure-OE: an enumeration that only performs OE on pure programs, i.e., fully enumerates the space outside preconditions that cover the initial state. Additionally, to measure the effect of subsumption, we also add 3. Subsume-Post: an enumeration with the SObEq OE-reduction, but no subsumption during enumeration, only as postprocessing of results. The results are shown in Figure 8.

It is, of course, unsurprising that No-OE performs poorly: with no reduction of the space, few programs are feasible, and this version solves only 28 benchmarks within 7s, diverging completely after 10 minutes. Pure-OE is an interesting case: testing whether it should prune a program required computing its effect, an expensive operation that leads Pure-OE to be considerably slower. At the 1-hour limit, Subsume-Post does as well as SObEq. However, it lags a little behind SObEq for most of its run, having solved one fewer benchmark at 7s.

In summary, both the application of OE to the entire space and the application of subsumption help SObEq’s performance, particularly in an interactive timeout.

Figure 9: Handling of programs seen in the space of each benchmark (percentages plotted, no. programs in text). Benchmarks where mutation cannot occur and timeouts plotted separately.

We also looked at what happens to the discovered programs across runs of SObEq, shown in Figure 9, specifically to gauge how many programs are discarded and how many are subsumed by a more general program that already exists in the bank. We separate out runs that time out and benchmarks where all input variables are strings, i.e., no mutation is possible. In the latter, the number of mutating programs pruned is trivially 0, but subsumption still takes place.

On (weighted) average, only 3% of programs seen by SObEq are added to the bank (med 14%, min 0.1%, max 42% of the programs). 66% on average are pure programs pruned by equivalence (med 72%, min 2%, max 99.98%), and another 30% are mutating programs pruned by equivalence (med 1.5%, min 0%, max 90%). Finally, an average 0.2% of the programs are pruned by subsumption (med 0.4%, min 0.00002%, max 8%). While the ratios vary, each of the pruning methods in SObEq is significant for some of the benchmarks.

6.5 RQ4: Enumerating with preconditions vs. concrete states

(a) Interactive timeout.
(b) 1-hour timeout.
(c) Per-benchmark comparison.
Figure 10: RQ4: Comparing an enumeration with preconditions to an enumeration with concrete states. Points at (timeout,timeout) not shown in (c).

Finally, we examine the effect of the local reasoning in SObEq’s enumeration. Composing partial heaps takes effort, but how costly is it, and does it allow more benchmarks to be solved? To test this, we created a version of SObEq where instead of footprints, preconditions are always full concrete states. We compare this to SObEq’s enumeration in Figure 10.

For most benchmarks, both enumerations perform similarly with concrete states slightly faster. We expected to see this as an effect of the overhead of composition, but it is also caused by pure programs being banked in one main state, which changes the enumeration order favorably for May benchmarks. If we consider only Must benchmarks, SObEq overtakes the concrete state enumeration after 1.5s and stays ahead for the entire 1-hour run.

Three benchmarks are noteworthy in this experiment. The concrete state enumeration times out on our overview example (benchmark Max and Min with n) while SObEq solves it in under 5s. Two more, partition k largest and get range loopy while, SObEq solves in 90ms and 4.6s, resp., and enumerating with concrete states takes 5min and 32min, resp. All three share the need to use multiple variables after one is mutated, which means subprograms already enumerated with the unmutated variable do not make a valid sequence and need to be re-enumerated.

To conclude, SObEq’s local reasoning is faster for benchmarks that need to mutate, and is not significantly slower for other benchmarks.

7 Related work

Syntax- and Component-Guided Synthesis.

Syntax-Guided Synthesis (SyGuS) [3] is a form of the synthesis problem providing a specification along with a grammar defining its program space. It can also describe a form of synthesis where programs are constructed by syntax rules. The latter is essentially identical to Component-Based Synthesis, where the synthesizer is provided with a list of functions and operations spanning the program space. Both of these formulations are tackled by many synthesis strategies: enumerative [35, 52, 7, 14, 25, 22, 28, 54, 9, 62, 34], SMT [30, 6, 50, 1] and SAT [56, 58] based, by traversing constructed representations of the space such as VSAs [31, 33, 63], version spaces [53] and Petri-nets [15, 24].

Proof-directed synthesizers.

A specific form of enumerative synthesis is one that attempts to find a proof for a proof goal, where the proof dictates a program. This idea is not new [57], and some deductive type-driven synthesizers describe their search as a proof search [24, 42]. Other works’ goals are more expressive than types: refinement types [48], relational specifications [32], select/update operators [41], Separation Logic, and even natural language [10]. Importantly, proof searches are overwhelmingly deductive, searching the proof space top-down. To our knowledge, SObEq is the first synthesis proof search that enumerates the proof space bottom-up.

Synthesis with Separation Logic.

Separation Logic proofs are a popular vehicle for proof-directed synthesis. They have been successfully used for parallelizing non-parallel code [8], but are most often used to synthesize heap-manipulating programs. SuSLik [49] performs a deductive search with inferences about the heap to generate programs performing pointer operations. Cypress [29] extends this with the ability to synthesize recursive auxiliaries, ROBoSuSLik [11] extends it with read-only borrows, and RusSOL [20] applies it to Rust programs, leveraging Rust’s type system to simplify the required specification from formal assertions to functional annotations of Rust functions. All of these accept assertions as their specification, hinge on manually-crafted inference rules, and employ solvers. SObEq, in contrast, works on input-output examples and relies only the generic Eval rule in conjunction with an interpreter.

Observational equivalence.

Observational Equivalence [59, 2] is a popular way to reduce the size of the space when enumerating bottom-up. Originally OE works on examples, but RESL [45] generalized it to other specifications on programs as well; this generalization also introduced the notions of interchangeability and consistency that SObEq’s correctness relies on. SyMetric [18] relaxes OE to observational similarity by choosing representatives via clustering instead of equivalence, then repairing the resulting program. While this compacts the space like SObEq’s subsumption, correctness of SyMetric’s solutions hinges on the success of its repair step.

Many of the synthesizers that use OE only synthesize pure programs [46, 7, 45, 2, 60, 40] (we include [45] in this list, despite its misuse of JavaScript’s Array.sort), others employ synthesis with OE to produce pure expressions that are then used in assignments or other larger mutating expressions [37, 38, 59, 17, 16], creating a top-level program with memory mutations.

Arborist [36] and Lens [47] relax the purity assumption in interesting ways. Arborist tackles the specific problem of dependent loops (e.g., fold) where the composition of smaller programs discovers new inputs and changes the equivalence classes. Arborist uses finite tree automata to unify programs without discarding them, which allows it to iteratively break up equivalence classes. Arborist finds a solution if its iterative refinement terminates. Lens take advantage of its target domain (straight line assembly code) to treat operations as having no arguments, making sequences the only composition. In this representation, OE values are full register valuations, and synthesis of state-mutating programs reduces to the OE enumeration used for pure programs. Both hinge on the ability to represent equivalence classes without discarding any programs using, implementing OE without a bank, which is only tractable under very strict assumptions on the set of components and on values.

Synthesis with heap mutations.

The problem of synthesizing mutating snippets has been worked around in many ways. SL proof-directed approaches assume a limited target language of mutations available to the proof. Sketch[56], Transit [59], SnipPy [17], Simpl [55] and Cozy [37, 38] assume expressions are pure, then use them in specific mutating contexts validated by other means (e.g., SMT). SyPet [15] and EdSynth [61] rely on types for getting programs, and only then evaluate them. Likewise, FrAngel [52] draws random programs, then evaluates them.

LooPy’s [16] problem statement bounds the intermediate mutations, and is to enumerate independent paths between these states. Similarly Lens [47] reduces all registers to 3-bit and constructs the full space of intermediate states. Cobalt [41] also separates pure expressions from mutations using provided API annotations. Toshokan [27] encodes the specifications for mutating methods as two versions of the function, one for changes to the state and the other the for result, akin to the two parts of SObEq’s postcondition.

CodeHint [22] is, in some ways, the closest to our approach: it defines expressions as equivalent if they have the same value and side effects in the current state, or, in our terminology, an equivalence relation on both before- and after-states. How current states are used in enumeration is not defined in the paper, but their implementation uses programs enumerated on the initial state to populate their bank for new after-states, replicating the problem discussed in Section 2.1. Moreover, CodeHint’s focus is its interactivity, and its interaction is designed around a single example, with any additional information only filtering an initial list of programs, not informing the OE enumeration. This is why we ruled it out as a baseline in Section 6.1.

SObEq sheds assumptions that expressions are pure and that intermediate states are known in advance. Moreover, SObEq has the formal guarantee in Theorem 6, unlike the two most general existing solutions: FrAngel, which randomly draws programs, and CodeHint, which performs OE only on its first example and mutated states in an unprincipled manner.

Other side effects like file or database writes have been a challenge in program synthesis. RbSyn [25] addresses database writes via a type-guided top-down enumeration from a syntax including effect holes that can be filled with annotated mutating methods; expressions not within effect holes have a purity assumption, limiting how effects are composed, unlike SObEq that can use mutating operations anywhere. CodeHint allows users to enable effects on the file system in their enumeration, even though they cannot be undone like memory mutations are.

References

  • [1] Alessandro Abate, Cristina David, Pascal Kesseli, Daniel Kroening, and Elizabeth Polgreen. Counterexample guided inductive synthesis modulo theories. In Hana Chockler and Georg Weissenbacher, editors, Computer Aided Verification, pages 270–288, Cham, 2018. Springer International Publishing. doi:10.1007/978-3-319-96145-3_15.
  • [2] Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. Recursive program synthesis. In International conference on computer aided verification, pages 934–950. Springer, 2013. doi:10.1007/978-3-642-39799-8_67.
  • [3] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. Dependable Software Systems Engineering, 40:1–25, 2015. doi:10.3233/978-1-61499-495-4-1.
  • [4] Rajeev Alur, Dana Fisman, Rishabh Singh, and Armando Solar-Lezama. Sygus-comp 2017: Results and analysis. arXiv preprint, 2017. arXiv:1711.11438.
  • [5] Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. Scaling enumerative program synthesis via divide and conquer. In Tools and Algorithms for the Construction and Analysis of Systems: 23rd International Conference, TACAS 2017, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2017, Uppsala, Sweden, April 22-29, 2017, Proceedings, Part I 23, pages 319–336. Springer, 2017. doi:10.1007/978-3-662-54577-5_18.
  • [6] Haniel Barbosa, Andrew Reynolds, Daniel Larraz, and Cesare Tinelli. Extending enumerative function synthesis via SMT-driven classification. In 2019 Formal Methods in Computer Aided Design (FMCAD), pages 212–220. IEEE, 2019. doi:10.23919/FMCAD.2019.8894267.
  • [7] Shraddha Barke, Hila Peleg, and Nadia Polikarpova. Just-in-time learning for bottom-up enumerative synthesis. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–29, 2020. doi:10.1145/3428295.
  • [8] Matko Botinčan, Mike Dodds, and Suresh Jagannathan. Proof-directed parallelization synthesis by separation logic. ACM Trans. Program. Lang. Syst., 35(2), July 2013. doi:10.1145/2491522.2491525.
  • [9] José Cambronero, Sumit Gulwani, Vu Le, Daniel Perelman, Arjun Radhakrishna, Clint Simon, and Ashish Tiwari. FlashFill++: Scaling programming by example by cutting to the chase. Proc. ACM Program. Lang., 7(POPL), January 2023. doi:10.1145/3571226.
  • [10] João Costa Seco, Jonathan Aldrich, Luís Carvalho, Bernardo Toninho, and Carla Ferreira. Derivations with holes for concept-based program synthesis. In Proceedings of the 2022 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pages 63–79, 2022. doi:10.1145/3563835.3567658.
  • [11] Andreea Costea, Amy Zhu, Nadia Polikarpova, and Ilya Sergey. Concise read-only specifications for better synthesis of programs with pointers. In Peter Müller, editor, Programming Languages and Systems, pages 141–168, Cham, 2020. Springer International Publishing. doi:10.1007/978-3-030-44914-8_6.
  • [12] Ian Drosos, Titus Barik, Philip J Guo, Robert DeLine, and Sumit Gulwani. Wrex: A unified programming-by-example interaction for synthesizing readable code for data scientists. In Proceedings of the 2020 CHI conference on human factors in computing systems, pages 1–12, 2020. doi:10.1145/3313831.3376442.
  • [13] Yu Feng, Ruben Martins, Osbert Bastani, and Isil Dillig. Program synthesis using conflict-driven learning. ACM SIGPLAN Notices, 53(4):420–435, 2018. doi:10.1145/3192366.3192382.
  • [14] Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. Component-based synthesis of table consolidation and transformation tasks from examples. ACM SIGPLAN Notices, 52(6):422–436, 2017.
  • [15] Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas Reps. Component-based synthesis for complex APIs. In Proceedings of the 44th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2017, 2017.
  • [16] Kasra Ferdowsifard, Shraddha Barke, Hila Peleg, Sorin Lerner, and Nadia Polikarpova. LooPy: Interactive program synthesis with control structures. Proceedings of the ACM on Programming Languages, 5(OOPSLA), October 2021. doi:10.1145/3485530.
  • [17] Kasra Ferdowsifard, Allen Ordookhanians, Hila Peleg, Sorin Lerner, and Nadia Polikarpova. Small-step live programming by example. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pages 614–626. Association for Computing Machinery, 2020. doi:10.1145/3379337.3415869.
  • [18] Jack Feser, Işıl Dillig, and Armando Solar-Lezama. Inductive program synthesis guided by observational program similarity. Proceedings of the ACM on Programming Languages, 7(OOPSLA2):912–940, 2023. doi:10.1145/3622830.
  • [19] John K Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-output examples. ACM SIGPLAN Notices, 50(6):229–239, 2015. doi:10.1145/2737924.2737977.
  • [20] Jonáš Fiala, Shachar Itzhaky, Peter Müller, Nadia Polikarpova, and Ilya Sergey. Leveraging Rust types for program synthesis. Proc. ACM Program. Lang., 7(PLDI), June 2023. doi:10.1145/3591278.
  • [21] Jonathan Frankle, Peter-Michael Osera, David Walker, and Steve Zdancewic. Example-directed synthesis: a type-theoretic interpretation. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 2016.
  • [22] Joel Galenson, Philip Reames, Rastislav Bodik, Björn Hartmann, and Koushik Sen. CodeHint: Dynamic and interactive synthesis of code snippets. In Proceedings of the 36th International Conference on Software Engineering, pages 653–663, 2014.
  • [23] Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’11, pages 317–330, New York, NY, USA, 2011. Association for Computing Machinery. doi:10.1145/1926385.1926423.
  • [24] Zheng Guo, Michael James, David Justo, Jiaxiao Zhou, Ziteng Wang, Ranjit Jhala, and Nadia Polikarpova. Program synthesis by type-guided abstraction refinement. Proceedings of the ACM on Programming Languages, 4(POPL):1–28, 2019. doi:10.1145/3371080.
  • [25] Sankha Narayan Guria, Jeffrey S Foster, and David Van Horn. RbSyn: type-and effect-guided program synthesis. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 344–358, 2021. doi:10.1145/3453483.3454048.
  • [26] Sankha Narayan Guria, Jeffrey S Foster, and David Van Horn. Absynthe: Abstract interpretation-guided synthesis. Proceedings of the ACM on Programming Languages, 7(PLDI):1584–1607, 2023. doi:10.1145/3591285.
  • [27] Kangjing Huang and Xiaokang Qiu. Bootstrapping library-based synthesis. In International Static Analysis Symposium, pages 272–298. Springer, 2022. doi:10.1007/978-3-031-22308-2_13.
  • [28] Kangjing Huang, Xiaokang Qiu, Peiyuan Shen, and Yanjun Wang. Reconciling enumerative and deductive program synthesis. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2020, pages 1159–1174, New York, NY, USA, 2020. Association for Computing Machinery. doi:10.1145/3385412.3386027.
  • [29] Shachar Itzhaky, Hila Peleg, Nadia Polikarpova, Reuben N. S. Rowe, and Ilya Sergey. Cyclic program synthesis. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, pages 944–959, New York, NY, USA, 2021. Association for Computing Machinery. doi:10.1145/3453483.3454087.
  • [30] Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. Oracle-guided component-based program synthesis. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, volume 1, pages 215–224. IEEE, 2010. doi:10.1145/1806799.1806833.
  • [31] Ruyi Ji, Jingjing Liang, Yingfei Xiong, Lu Zhang, and Zhenjiang Hu. Question selection for interactive program synthesis. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 1143–1158, 2020. doi:10.1145/3385412.3386025.
  • [32] Etienne Kneuss, Ivan Kuraj, Viktor Kuncak, and Philippe Suter. Synthesis modulo recursive functions. In Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications, pages 407–426, 2013. doi:10.1145/2509136.2509555.
  • [33] Tessa A Lau, Pedro M Domingos, and Daniel S Weld. Version space algebra and its application to programming by demonstration. In ICML, pages 527–534, 2000.
  • [34] Woosuk Lee. Combining the top-down propagation and bottom-up enumeration for inductive program synthesis. Proceedings of the ACM on Programming Languages, 5(POPL):1–28, 2021. doi:10.1145/3434335.
  • [35] Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. Accelerating search-based program synthesis using learned probabilistic models. ACM SIGPLAN Notices, 53(4):436–449, 2018. doi:10.1145/3192366.3192410.
  • [36] Xiang Li, Xiangyu Zhou, Rui Dong, Yihong Zhang, and Xinyu Wang. Efficient bottom-up synthesis for programs with local variables. Proceedings of the ACM on Programming Languages, 8(POPL), 2024. doi:10.1145/3632894.
  • [37] Calvin Loncaric, Michael D Ernst, and Emina Torlak. Generalized data structure synthesis. In Proceedings of the 40th International Conference on Software Engineering, pages 958–968, 2018. doi:10.1145/3180155.3180211.
  • [38] Calvin Loncaric, Emina Torlak, and Michael D. Ernst. Fast synthesis of fast collections. In PLDI 2016: Proceedings of the ACM SIGPLAN 2016 Conference on Programming Language Design and Implementation, pages 355–368, Santa Barbara, CA, USA, June 2016. doi:10.1145/2908080.2908122.
  • [39] David Mandelin, Lin Xu, Rastislav Bodík, and Doug Kimelman. Jungloid mining: Helping to navigate the API jungle. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 48–61, New York, NY, USA, 2005. Association for Computing Machinery. doi:10.1145/1065010.1065018.
  • [40] Anders Miltner, Adrian Trejo Nuñez, Ana Brendel, Swarat Chaudhuri, and Isil Dillig. Bottom-up synthesis of recursive functional programs using angelic execution. Proceedings of the ACM on Programming Languages, 6(POPL):1–29, 2022. doi:10.1145/3498682.
  • [41] Ashish Mishra and Suresh Jagannathan. Specification-guided component-based synthesis from effectful libraries. Proceedings of the ACM on Programming Languages, 6(OOPSLA2):616–645, 2022. doi:10.1145/3563310.
  • [42] Peter-Michael Osera and Steve Zdancewic. Type-and-example-directed program synthesis. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 619–630, 2015. doi:10.1145/2737924.2738007.
  • [43] Antti Oulasvirta and Pertti Saariluoma. Surviving task interruptions: Investigating the implications of long-term working memory theory. International Journal of Human-Computer Studies, 64(10):941–961, 2006. doi:10.1016/J.IJHCS.2006.04.006.
  • [44] Peter O’Hearn, John Reynolds, and Hongseok Yang. Local reasoning about programs that alter data structures. In Computer Science Logic: 15th International Workshop, CSL 2001 10th Annual Conference of the EACSL Paris, France, September 10–13, 2001, Proceedings 15, pages 1–19. Springer, 2001. doi:10.1007/3-540-44802-0_1.
  • [45] Hila Peleg, Roi Gabai, Shachar Itzhaky, and Eran Yahav. Programming with a read-eval-synth loop. Proceedings of the ACM on Programming Languages, 4(OOPSLA), November 2020. doi:10.1145/3428227.
  • [46] Hila Peleg and Nadia Polikarpova. Perfect is the enemy of good: Best-effort program synthesis. In 34th European Conference on Object-Oriented Programming (ECOOP 2020). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020.
  • [47] Phitchaya Mangpo Phothilimthana, Aditya Thakur, Rastislav Bodik, and Dinakar Dhurjati. Scaling up superoptimization. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pages 297–310, 2016.
  • [48] Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. Program synthesis from polymorphic refinement types. ACM SIGPLAN Notices, 51(6):522–538, 2016. doi:10.1145/2908080.2908093.
  • [49] Nadia Polikarpova and Ilya Sergey. Structuring the synthesis of heap-manipulating programs. Proc. ACM Program. Lang., 3(POPL), January 2019. doi:10.1145/3290385.
  • [50] Andrew Reynolds, Haniel Barbosa, Andres Nötzli, Clark Barrett, and Cesare Tinelli. cvc4sy: smart and fast term enumeration for syntax-guided synthesis. In International Conference on Computer Aided Verification, pages 74–83. Springer, 2019. doi:10.1007/978-3-030-25543-5_5.
  • [51] John C Reynolds. Separation logic: A logic for shared mutable data structures. In Proceedings 17th Annual IEEE Symposium on Logic in Computer Science, pages 55–74. IEEE, 2002. doi:10.1109/LICS.2002.1029817.
  • [52] Kensen Shi, Jacob Steinhardt, and Percy Liang. FrAngel: component-based synthesis with control structures. Proceedings of the ACM on Programming Languages, 3(POPL):1–29, 2019. doi:10.1145/3290386.
  • [53] Xujie Si, Woosuk Lee, Richard Zhang, Aws Albarghouthi, Paraschos Koutris, and Mayur Naik. Syntax-guided synthesis of datalog programs. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 515–527, 2018. doi:10.1145/3236024.3236034.
  • [54] Calvin Smith and Aws Albarghouthi. MapReduce program synthesis. Acm Sigplan Notices, 51(6):326–340, 2016. doi:10.1145/2908080.2908102.
  • [55] Sunbeom So and Hakjoo Oh. Synthesizing imperative programs from examples guided by static analysis. In International Static Analysis Symposium, pages 364–381. Springer, 2017. doi:10.1007/978-3-319-66706-5_18.
  • [56] Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. Combinatorial sketching for finite programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, pages 404–415, 2006.
  • [57] Jamie Stark and Andrew Ireland. Towards automatic imperative program synthesis through proof planning. In 14th IEEE International Conference on Automated Software Engineering, pages 44–51. IEEE, 1999. doi:10.1109/ASE.1999.802091.
  • [58] Yoshiki Takashima, Ruben Martins, Limin Jia, and Corina S Păsăreanu. SyRust: automatic testing of Rust libraries with semantic-aware program synthesis. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, pages 899–913, 2021.
  • [59] Abhishek Udupa, Arun Raghavan, Jyotirmoy V. Deshmukh, Sela Mador-Haim, Milo M.K. Martin, and Rajeev Alur. TRANSIT: Specifying protocols with concolic snippets. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 287–296, New York, NY, USA, 2013. Association for Computing Machinery. doi:10.1145/2491956.2462174.
  • [60] Chenglong Wang, Alvin Cheung, and Rastislav Bodik. Synthesizing highly expressive SQL queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 452–466. ACM, 2017.
  • [61] Zijiang Yang, Jinru Hua, Kaiyuan Wang, and Sarfraz Khurshid. EdSynth: Synthesizing api sequences with conditionals and loops. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), pages 161–171. IEEE, 2018. doi:10.1109/ICST.2018.00025.
  • [62] Yongho Yoon, Woosuk Lee, and Kwangkeun Yi. Inductive program synthesis via iterative forward-backward abstract interpretation. Proceedings of the ACM on Programming Languages, 7(PLDI):1657–1681, 2023. doi:10.1145/3591288.
  • [63] Yongwei Yuan, Arjun Radhakrishna, and Roopsha Samanta. Trace-guided inductive synthesis of recursive functional programs. Proceedings of the ACM on Programming Languages, 7(PLDI):860–883, 2023. doi:10.1145/3591255.