Nondeterministic Asynchronous Dataflow in Isabelle/HOL
Abstract
We formalize nondeterministic asynchronous dataflow networks in Isabelle/HOL. Dataflow networks are comprised of operators that are capable of communicating with the network, performing silent computations, and making nondeterministic choices. We represent operators using a shallow embedding as codatatypes. Using this representation, we define standard asynchronous dataflow primitives, including sequential and parallel composition and a feedback operator. These primitives adhere to a number of laws from the literature, which we prove by coinduction using weak bisimilarity as our equality. Albeit coinductive and nondeterministic, our model is executable via code extraction to Haskell.
Keywords and phrases:
dataflow, verification, coinduction, Isabelle/HOLCopyright and License:
2012 ACM Subject Classification:
Security and privacy Logic and verification ; Computing methodologies Distributed algorithms ; Software and its engineering Data flow languagesAcknowledgements:
We thank Asta H. From for numerous discussions on the representation of operators that were instrumental for arriving at the end product presented in this paper and Jan A. Bergstra, Cornelis A. Middelburg, and Gheorghe Stefanescu for discussing their axiom for merge with us and for encouraging words on the merits of formal verification. Thomas Hildebrandt pointed us to Milner’s classic on weak bisimilarity. The ITP reviewers provided helpful textual suggestions.Funding:
This work is supported by an Independent Research Fund Denmark Sapere Aude grant (3120-00057B, titled DISCOVER) and a Novo Nordisk Foundation start package grant (NNF20OC0063462).Editors:
Yannick Forster and Chantal KellerSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Data stream processing frameworks like Flink [12], Kafka [21], Spark Streaming [34], and Timely Dataflow [26] are widely used in industry to compute with and analyze streams of data at a large scale. With a long-term objective of formally verifying the correctness of programs expressed in such frameworks in mind, we focus on their common algebraic dataflow foundation in this paper. The dataflow programming paradigm represents a streaming computation as a graph of operators, which are themselves data stream transformers.
We focus on asynchronous dataflow in which operators work autonomously without blocking the overall computation following the mantra “if I cannot make progress maybe someone else can.” Asynchronous computation is naturally nondeterministic: the order in which operators are invoked is not predetermined and might affect the overall computation’s behavior.
Bergstra et al. [3] provide an algebra of asynchronous dataflow operators and building blocks (including sequential and parallel composition and a feedback loop construct), axiomatize the properties of these operators, and give two instances satisfying the axioms. We develop a new instance in Isabelle/HOL using a shallow embedding as a coinductive datatype (short: codatatype). Our operators are essentially nondeterministic input-output machines with silent actions (Section 2). The coinductive representation allows us to keep the state implicit when composing operators. We use corecursion to define composition, feedback, and the various basic asynchronous operators of Bergstra et al. (Section 3). We use countable sets to represent nondeterminism, which simplifies our definition of composition and feedback operators.
We validate that our operators work as intended by proving all but one of Bergstra et al.’s axioms by coinduction (Section 4). Thereby, we use weak bisimilarity as the notion of operator equality. Some of our originally proved properties deviate mildly from Bergstra et al.’s axioms due to a simplification we made in some operators’ definitions and the fact that Isabelle does not have empty types. (Types are used by both us and Bergstra et al. to index the inputs and outputs of an operator. Bergstra et al.’s empty types signify that an operator has no inputs or no outputs). We show how to close this gap and obtain precisely Bergstra et al.’s axioms by working with the subtype of well-behaved operators (Section 5).
With the help of Isabelle’s code generator, we extract executable Haskell code from our formalization (Section 6). To this end, we represent countable sets as a quotient of lazy lists. This allows us to run operators, which is useful for testing. A particular challenge is the executability of the union of a countable set of countable sets as the corresponding merge function on lazy lists. We conclude with discussions of related work (Section 7) and particular formalization aspects (Section 8). Our formalization is publicly available [31].
2 Operators as a Codatatype
Isabelle/HOL’s codatatypes [6] formalize potentially infinite objects such as infinite streams ( type) and lazy lists ( type). We introduce operators as:
This command introduces a new type constructor, (written postfix), with type parameters indexing the input ports, indexing the output ports, and representing the type of data items (or events) an operator processes. An operator will perform one of four actions corresponding to the four codatatype constructors: (1) read an event from a given input port and continue based on the read event, (2) write an event to a given output port and continue as specified, (3) execute an internal (silent) action and continue as specified, and (4) nondeterministically choose a continuation from a countable set (type ) of options. Operators are possibly infinite trees with nodes labeled by one of these four actions with branching over the read event and branching over the countable set of choices. For an operator to be finite, the operator must eventually reach the only leaf along all branches. Note that we use the subscript c to distinguish sets and operations on them from their countable set counterparts: and are separate types in Isabelle.
The command also introduces the functions and that collect the sets of used ports as well as a map function on the ports . Note that an operator does not necessarily have to use all available ports. For example, an operator of type does not necessarily have an infinite number of input and output ports; rather, it uses some subset of naturally-numbered ports, given by and .
Isabelle requires all functions to be total. For recursive functions on datatypes, totality means termination. In contrast, corecursive functions, which produce elements of a codatatype, are guaranteed to be total if they are productive – that is, they must always eventually produce a codatatype constructor. The command facilitates the definition of corecursive functions, in which a codatatype constructor guards all corecursive calls and all call contexts are comprised of friendly functions [5], thereby maintaining productivity and totality.
For example, we define two uncommunicative operators using in Figure 1. The operator corecurses forever with the constructor. The operator corecurses forever with the (singleton) constructor as shown in the lemma . The actual definition is slightly more convoluted. The codatatype nests the corecursive occurrence of , in the countable set type , which means that any corecursive call guarded by the constructor must be applied using the map function on , i.e., the image function (written infix). To this end, we pull the corecursive call out of the singleton set. We also use similar transformations in some subsequent operators. To improve readability, we only present the user-friendly derived equations, such as , rather than the original command.
2.1 Operators Equivalences
The initial examples , , and represent operators that appear to be identical to an external observer examining their (lack of) communication. However, these operators are syntactically different, which prevents us from proving their equality using ’s coinduction principle. Syntactic equality proves too rigid to support more nuanced semantic reasoning.
This limitation leads us to Milner’s classic approach of using bisimilarity [25], which is an equivalence relation based on labeled transition systems (LTS). In an LTS, transitions are associated with labels that represent the performed actions. For our operators, these actions can be a read, a write, or a silent action (). Two operators are bisimilar if their corresponding transition systems can mutually simulate each other’s transitions. In this work, we formalize both strong bisimilarity, which requires matching transitions one-by-one in simulations, and weak bisimilarity, which allows one to abstract away finite sequences of silent actions in simulations.
In Figure 2, we define the datatype for labels and our LTS as the inductive predicate , where the first three introduction rules associate the , , and operator constructors with the appropriate labels. The constructor does not have a label on its own; it depends on the existence of a label from a made by an operator in its set of choices.
Strong simulation follows from the definition (Figure 2) that takes a relation and two operators and , which must meet the following condition: whenever steps to with a label , there must exist a corresponding step from to some with the same label , such that relates and . Relation is a strong simulation if . Moreover, is a strong bisimulation when both and its converse are strong simulations. Strong bisimilarity is the largest strong bisimulation.
In Isabelle/HOL, a coinductive predicate is defined as the greatest fixed point of a corresponding predicate transformer on the predicate lattice following the Knaster–Tarski’s theorem [27]. Coinductive predicates are convenient for defining strong bisimilarity (denoted with , Figure 2) because they provide a useful coinduction principle [1]. We reformulate and mildly strengthen this coinduction principle for , to improve readability and usability, by incorporating basic upto-techniques for strong bisimilarity (yielding bisimilarity up to reflexivity, up to symmetry, and up to bisimilarity) via the auxiliary inductive predicate . A coinduction proof that a coinductive predicate holds for given arguments thus involves exhibiting a witness relation that relates these arguments. Isabelle’s proof method [6] generates a canonical instantiation for when applied to a specific goal.
This principle is first used to demonstrate that indeed constitutes an equivalence relation, i.e., it is reflexive, symmetric, and transitive. The next proof by coinduction shows that , which is a consequence that neither operator can make a . Three more intricate examples are given in Figure 3. These operators output the natural number at port until they may eventually stop doing so. They illustrate our general technique for proving strong (and weak) bisimilarity: we derive elimination and introduction rules for specific operators, which allow us to prove the cases resulting from the application of the proof method, e.g., to prove . Note that neither nor are strongly bisimilar to , because can perform actions while the other two cannot. This leads us towards the notion of weak bisimilarity, which can abstract silent steps away.
To define weak bisimilarity, we first define what constitutes a weak step, denoted by in Figure 4. We use Isabelle’s notation: infix is the relation composition, the superscript denotes the reflexive-transitive closure, and the superscript the reflexive closure. A weak step consists of a finite, possibly empty sequence of steps, followed by a single extended step () and another sequence of steps. The extended step introduces reflexivity for actions, which enables operators that lack steps to simulate operators that include steps.
The definition of is similar to , but it uses a weak step in the implication’s conclusion. Specifically, for to be simulated by , every step taken by with label must be matched using a corresponding weak step by with the same label . The weak bisimulation coinductive predicate (denoted with ) is analogous to .
We show that is an equivalence relation by coinduction using wbisim_coinduct following Milner [25, p. 109]. We also show that strong bisimilarity implies weak bisimilarity: . Lastly, we have as an example of equivalence of operators in which one step of is simulated by a “reflexive step” of .
2.2 Trace Equivalence
The concept of an external observer analyzing communication channels is formalized through the notion of trace. More specifically, we adopt Jonsson’s view [19]: a trace is a potentially non-terminating, linearly ordered sequence of communication events produced by an operator. Hence, trace equivalence relates two operators if and only if they have the same set of traces.
In Figure 5, we define visible actions using the datatype , along with a function that transforms into . We then introduce the concept of an operator being finished, meaning that the operator can no longer produce any visible actions. This is formalized using the coinductive predicate (the prefix means we are in a weak setting that does not observe steps). Next, we define the traces of an operator using the coinductive predicate , which consists of two cases: the base case, which associates the empty lazy list () with the operator being finished, and the coinductive case, which prepends (using ) the visible action from a weak step to a trace of the operator after this weak step. Lastly, trace equivalence is the equality of the set of traces. As expected, we show that weak bisimilarity implies trace equivalence .
2.3 Numeral Types for Ports
A typical choice for input/output port types is to use Isabelle/HOL’s existing numeral types , , , …which model sets of the respective finite cardinality. For example, an operator with two inputs and three outputs can be represented with the type .
Because all Isabelle/HOL types must be nonempty, the numeral type is peculiar: it is isomorphic to the type of natural numbers. Its “cardinality” is still , because in Isabelle’s standard library the cardinality of infinite sets is defined to be . This raises a challenge when working with operators that should have no inputs or outputs, such as operator of type or . Although these operators should not be capable of reading or writing data, they may use the infinitely many ports available in .
To address this limitation, we introduce a type class called , that defines a set of elements named . (A similar type class singling out a particular default value already existed in Isabelle, but working with a set results in better closure properties for sum types.) We instantiate the type class for by setting its to (the universal set containing all elements of the specified type). For other numeral types , , …, their instance defines . We also make the sum type an instance of by taking the union of the from both sides. Hence, the type , just as , has no non-default elements. The set of all non-default elements is . We introduce another type class where , as a subclass of . Naturally, we also provide instances of for and . Finally, when defining operators, we assume that both input and output ports belong to the and type classes and follow the convention to only use ports in (the countable set counterpart of ).
3 Asynchronous Dataflow Operators
This section explains the set of operators of the network algebra for asynchronous dataflow. Following Bergstra et al. [3], they are divided into two categories. First, basic network algebra consists of three operators for structuring the network – sequential composition, parallel composition, and feedback – along with the identity and transposition constants. The second category includes additional asynchronous dataflow constants such as split, sink, merge, dummy source, asynchronous copy, and asynchronous equality test. Except for the sink, all operators are stateful, maintaining internal buffers to store communication data. These buffers function as intermediate first-in-first-out communication channels where data can reside indefinitely.
3.1 Buffer Infrastructure
Most of our definitions will include a buffer function (where is the type of the relevant ports). This buffer function represents the operator’s current state, e.g., what data has already been read (and on which ports) but not yet written. In the rest of the paper, we will make use of the following convenience functions on buffers: gives the first element, on the given port, in the buffer; dequeues this first element; enqueues an element to the buffer; (or infix ) concatenates the data contained in two buffers. Buffers act as queues: we add elements to the list’s end and remove from its start.
3.2 Network Algebra Operators
We present building blocks for defining arbitrary asynchronous dataflow operators. We recall the convention established in the previous section: only ports in can be used. The first two primitives are the identity and transposition operators, defined respectively as and . Their code equations are given in Figure 6.
The operator is a stream delayer: it can read data from any of its (non-default) input ports and enqueue it in its corresponding internal buffer; the operator can also dequeue data from any non-empty buffer and write data to the corresponding output port. The operator has type , i.e., it operates on a sum type for the input ports and transposes the ports in the output. For instance, the data read on some input port , where , is enqueued on the internal buffer of the transposition operator and may eventually be written on output port . We abbreviate the initial states of these operators (in which internal buffers are empty) by and respectively.
To combine operators, we define a general composition operator in Figure 7. To compose two operators, we must specify the wiring between them, as well as the buffer associated with this wiring. Given operators and , the wire is a partial function defining how ’s output ports are connected to ’s input ports.
The definition of uses the function and abbreviation. The function descends through constructors until it reaches a constructor. It thus computes all steps an operator can make (e.g., and ). It returns a countable set of operators: , , and are returned as singletons and for we obtain the union of applied to each operator in ; formally . The function filters the operators in to only keep those whose port either is not connected by the wire (i.e., they communicate with the external world) or the buffer on the port connected by the wire is not empty. This makes reading from the empty buffer an impossible action. The standard function returns the range of a partial function.
The operator proceeds by stepping one of the operators independently. In particular, either of the operators may be neglected and never take steps. This design is intentional, as it reflects the behavior of independent nodes in a decentralized network, where no global orchestrator ensures a fair or synchronized execution schedule. Moreover, the left operator is restricted to reading inputs from the external environment, while its outputs can be directed either to the wire’s buffers or to the external world, depending on the wiring. Conversely, the second operator can read either from the buffer or from the external environment, again based on the wiring, but its outputs are limited to the external world. Interactions with the wire’s buffers are recorded by as a silent action.
We consider two special cases for the wiring function. When , the two operators are not interacting with each other and we thus obtain parallel composition (infix ): the sum types of input and output ports reflect this. When , it means that the output ports of are connected with the input ports of so that all the data that may read must have been written by . Thus, we obtain sequential composition (infix ). Here, is applied to remove the sum types, since in this case we know that the sequential composition’s input ports are the input ports of the first operator, and its output ports are the outputs ports of the second operator. Moreover, we prove and .
Similarly to we define a general loop operator in Figure 8, which utilizes the function and is also parameterized by a wiring. The wiring determines when data is going to/coming from the outside environment or the loop’s own buffer (which is both written to and read from). We give a standard wiring configuration using the sum type, where the left-side ports communicate with the external environment, and the right-side ports handle the looping-back mechanism. We call with this wiring instance the feedback operator (). Notably, ports in the set are excluded from looping-back, enabling the definition of feedback operators without any looping-back ports (e.g., when the sum’s right-side is the type). We show that has similar / lemmas as the ones for .
The composition and loop operators combine other operators. Thus it is useful to have congruence rules for them. Below, we show ’s and ’s congruence rules for weak bisimilarity. Similar rules also hold for strong bisimilarity as well as for the derived operators.
Now, we describe the remaining operators shown in Figure 9. The dummy source operator ( ! ) is defined as the sequential composition of with the . Since does not produce any output, the identity operator, in turn, also generates no output. In contrast, the sink operator () is capable of reading (and ignoring) data from all its usable input ports.
To informally describe the final operators, we consider a fixed usable port of type . The split operator () reads data from and nondeterministically buffers it either on the left () or on the right (). Eventually, it may send that input data out through the corresponding port ( or ). The merge operator () acts as the dual of by nondeterministically reading data from either the left () or the right () port and buffering it. may eventually send this input data out through port . The asynchronous copy operator () behaves similarly to (), but it buffers the data on both sides () and (). Finally, the equality test operator () resembles (). However, it operates with optional data and only outputs the data when the heads of both buffers are equal; otherwise, it outputs .
4 Asynchronous Dataflow Properties
We show that the axioms from Tables 1, 2, and 3 presented in Bergstra et al. [3] with the exception of axiom in their Table 3 are valid in our shallow embedding. We discuss insights from proving the axioms and explain formulation differences. To make referencing transparent, we will use the axiom numbering of Bergstra et al.’s, but we merge their Tables 2 and 3 (which use the same axiom numbers) into a single table.
First, we establish a notation for sequentially composing an operator with identity: denotes and is . This pre-/post-composition is added in our formalization to certain lemmas to ensure that the operator exhibits the required asynchronous behavior. This allows for a more precise account on where these additions are needed. For instance, the process algebra model from Bergstra et al. [3] surrounds most operators by identities on both sides. Changes to axiom statements of this nature are highlighted in gray. In Section 5, we demonstrate how these differences can be eliminated through the introduction of a new type.
|
|
||||
|
|
||||
|
|
||||
|
|
||||
|
|
||||
|
|
||||
The first set of axioms is shown in Table 1. First, we identify some notable axioms. Axiom is the associativity of parallel composition and axiom is the associativity of sequential composition. Notice that for we need to map ports with the function so the sum types are properly reassociated. We also define its inverse function . Bergstra et al. mostly ignore such port mismatches; in a proof assistant more rigor is needed. Both parts of show that an operator sequentially composed with identity can absorb another identity. This follows directly from the auxiliary lemma. We call such lemmas that remove identities identity absorption.
The feedback operator axioms require additional assumptions to guarantee that arbitrary operators do not use default ports when looping-back (on the right). These assumptions use the inverse image function to capture the right inputs and outputs. Recall that the wiring of forces ports on the right to not loop back. Without these assumptions, the invariant that the ports on the right are the looping-back ones would be violated. Again, these modifications are highlighted and will be abstracted away in Section 5.
Table 2 is about the asynchronous equality test, merge, copy, split, the dummy source, and sink operators. Axioms , , , and can be seen as a recursive characterization of equality test, merge, copy, and split. The merge and split operators satisfy fewer axioms than equality test and copy. For example, first duplicates inputs and then performs an equality test on them on the subsequent operator, which is bisimilar to the identity operator. For the and , the axiom is not valid because nondeterministic split choices do not necessarily align with the nondeterministic merge order. A related reason prevented us from proving for . In fact, we proved that and are not weakly bisimilar as they make partial nondeterministic choices in different orders. We explain this in more detail at the end of the section. We do believe that these operators are trace equivalent, but do not have a formal proof of this statement.
To prove the axioms, we follow the approach outlined in Section 2.1. This involves deriving introduction and elimination rules, such as the ones in Figure 3, for each operator and using these rules to construct the required simulations generated by the coinduction. The coinduction proofs require generalizing the operator’s buffers, as using empty buffers from the notations would eventually cause us to be outside of the bisimulation during the coinduction. Consequently, we must consider arbitrary buffers and establish an invariant between them. In most cases, this is straightforward. For example, in the generalization of axiom in Figure 10, the buffers from both sides are related using the function. In this figure, the left-side ports are of the numeral type , and the right-side ports are of type . Also, we re-use the same symbols as the introduced operator notations to unambiguously identify them.
A common pattern in the proofs involves forward reasoning to move data through the buffers. For instance, consider one of the cases in the proof, where on the right-hand side of the weak bisimilarity produces an output from the buffer. Then the left-hand side must also produce the same output. However, it may be the case that the buffers and are empty, and the head of must traverse the diagram to generate the output. This results in a sequence of steps that move ’s head to and then to . The Isar proof language [32] offers fact chaining using , which is well-suited for this kind of transitive reasoning.
The axiom was a challenging one to prove as it required a non-trivial buffer invariant. We consider the diagram in Figure 11, and we assume that , and similarly for , , and . Furthermore, we assume and . To find the invariant, we analyze the left-to-right simulation first. Assume that the left-hand side does an equality test step with . This step can only be weakly simulated on the right-hand side if either or both and can make some sequences of tests to catch up with . However, there is no way to have a simulation if either or has tested more than . The sequence of channels and must have the same length as they are consumed together by , and symmetrically and . Thus, for every port , there exist natural numbers , such that . As and advance independently, we maintain as invariant. The already tested elements in and in can be ahead of the tested elements on the right-hand side of the equivalence ( and ). This means that is precisely the same as followed by the sequence of elements still to be pairwise tested in and .
The established invariant must also hold for the right-to-left simulation. We assume an equality test on the right-hand side in testing if equals . We must avoid testing too little on the left-hand side of the equivalence to keep the invariant. For example, if and , this new equality test with and will reduce the size of and by one; therefore, one equality test on the left-hand side must happen as well.
The axiom for the merge operator is . Intuitively, equating the two sides means that it does not matter in which order the data from the three input ports are merged: all interleavings should be allowed by either operator. However, using weak bisimilarity, this axiom turns out to be invalid. Note we include here the post-composition with identities after the last merge operator, so we can later lift the result to the type defined in Section 5. Assuming that the operators are weakly bisimilar, then for every weak step that one operator makes, there must exist a weak step that the other operator makes such that they remain equivalent. We now sketch how we derive a contradiction from this. Consider that (where we annotate operators with indices to distinguish different instances of the same operator) reads a value on port 1, the top input port in Figure 12, and is moved beyond but before (into the buffer of sequential composition). Then we consider all possible weak steps can make by reading value on port 1. On a high level, there are two separate cases. First, assume is read and then may be moved along the buffers but only before actually performs the merge. Then this operator can read any value on port 2 (the middle input port in Figure 12) and output it before . However, this cannot be replicated by the left-hand side because has already moved beyond , so would be enqueued after and cannot be output before . The second case is that is read and then moved along the buffers beyond . Then the left-hand side can read any value on port 3 (the bottom input port in Figure 12) and output it before . However, this cannot be replicated by the right-hand side because has already moved beyond , so cannot be output before .
The main challenge we experienced in disproving weak bisimilarity is dealing with weak steps in the assumptions. Weak steps are significantly more difficult to reason about than single steps, because we do not have simple elimination rules for them, as any number of silent steps may occur before and after the visible action. We can use induction on these sequences of silent steps, but it is generally difficult to use the induction hypotheses because they typically require making buffers arbitrary, which was not practical for disproving . Our workaround was to explicitly obtain all possible states of the operators after a weak step, which was manageable here because we know exactly what the operators are and we start from empty buffers.
5 Well-Behaved Operators
We outline the process for eliminating the differences in our lemmas compared to Bergstra et al. [3]: the gray highlights in Table 1 and Table 2. To achieve this, we define a subtype of well-behaved operators using Isabelle’s command. The new type requires its elements to be weakly bisimilar to some operator surrounded by and .
Next, we lift every definition to this new type. This includes all operators, weak bisimilarity, and . The lifting is done by the command [18], which requires a proof that the lifted term respects the subtype’s property. When an operator has identity absorption lemmas, it can be lifted without changes; for example, can be lifted directly as it satisfies . Similarly, , , and are lifted directly too; they satisfy , , and .
Other operators such as are lifted with no changes as well because they also have identity absorption lemmas for both sides: . We lift the definitions of ! and using the type class constraint to ensure that they have no usable ports for their inputs and output, respectively. This guarantees their identity absorption lemmas. In contrast, the operators and satisfy only weaker one-sided absorption lemmas: , . Hence, to make the subtype constraint hold, we lift and instead of and .
The lifting of requires to hold. This is not true in general, but we can impose sufficient constraints on and . They must maintain the exclusive use of ports in the set (i.e., ). The same property must also hold for the inverse of and because in the bisimilarity proof their inverses are necessary to construct the respective simulation. In addition, this lemma further requires that and and their inverses are injective on . Fortunately, all usages of in all tables meet these conditions. Lastly, the highlighted assumptions from Table 1 are also eliminated for the lifted operators because they are weakly bisimilar to for some , the ports of are in , and weakly bisimilar operators have the same ports.
6 Code Generation
Our operators are infinitary in two respects: they may branch over a countably infinite set of choices and as codatatypes, they may be trees of infinite depth. We use code generation to Haskell [17] to execute operators lazily and produce prefixes of their traces up to a given depth.
A particular challenge for code generation is our usage of the abstract type of countable sets, which are originally defined as a subtype of arbitrary sets. The default code generation setup for sets supports finite and cofinite sets, but not the countable sets we are interested in. Instead, we provide a new setup for countable sets that recasts them as quotients [20] of the lazy list type modulo reordering of elements. For this purpose, we define the abstraction function which is used by the code generator as the only pseudo-constructor of countable sets: a standard way to provide code generation for quotients [16]. Then, operations on countable sets can be made executable by providing a code equation that pattern-matches on this pseudo-constructor and performs the according operation on the underlying lazy lists. For example, countable set union becomes lazy list interleaving:
Things become more challenging for functions on countable sets of more complex types, e.g., countable sets themselves. Specifically, the above recipe fails for the function . It is not a problem to define a corresponding function on lazy lists:
This definition crucially relies on Isabelle’s corecursion up-to friends [5], in particular that is a friendly function (which is proved automatically for its definition). However, the following equation uses the map function on lazy lists on the left and is thus invalid code:
A dual problem occurs for functions on subtypes with compound return types. Kunčar [22, §6.4] presents and automates a solution for subtypes, which inspires our solution for quotients. We introduce a type that is isomorphic to as a quotient of (with the abstraction function serving as the pseudo-constructor for code generation). Since this new type does not have a compound domain, the standard lifting of to the function yields a valid code equation. Thus we have reduced the problem to finding an executable function . This function can be defined by lifting the identity on lazy lists. And it can be made executable by lifting the lazy list constructors to and using the following code equations:
Overall, we obtain , which is executable. We remark that we cannot use Lochbihler and Stoop’s framework [24] for executing generated Isabelle code lazily in strict programming languages, because quotient types are not supported by that framework. Thus, we resort to Haskell as the code generation target.
7 Related Work
This work provides mechanized proofs of most asynchronous dataflow algebra axioms by Bergstra et al [3]. They present the two original algebra instances: the relation-based stream transformer model and the process algebra model using the Algebra of Communicating Processes (ACP) [2]. Our operator definitions closely follow their ACP counterparts. One difference is that most of their operators are defined by explicitly pre- and post-composing with identity operators ( instead of ), whereas we work with the core operators and only use the additional identities where they are necessary. For the stream transformer model, Broy and Ştefănescu [9] provide pen-and-paper proofs for a slightly different algebra.
The recent Isabelle/HOL formalization of time-aware stream processing [30] bears some similarities to our work. It defines stream processing operators using a codatatype with a single constructor that combines both reading and writing: a finite list of outputs is produced as a reaction to reading an input. In contrast, we separate both operations (so that the operator can decide what it wants to do next) and additionally support nondeterministic choices and silent steps. Their approach covers time metadata associated with processed events (which is necessary for aggregating computations) but does not support loops. We plan to extend our network algebra operators with capabilities to observe time metadata.
Chappe et al. [14] formalize choice trees – a semantic domain for modeling nondeterministic, recursive, and impure programs, based on interaction trees developed by Xia et al [33] in Rocq. (Later, interaction trees have also been formalized in Isabelle [15].) Our operators can be seen as a domain specific variant of choice trees tailored to asynchronous dataflow. In particular, our types make inputs and output ports explicit, which is used in composition and feedback operators. In contrast, choice trees are monadically composed along their only output. Due to Isabelle’s simple types, our operators are restricted to use a fixed type parameter for the data items. In a dependently-typed system, the type of data could be indexed by the corresponding port, much like the interaction type depends on the action type in choice/interaction trees.
Beyond the world of mechanization, the streaming language Flo [23] has been recently proposed as a formal representation of streaming computations originating from systems like Flink [12] and DBSP [11]. Flo makes two assumptions on these systems: streaming progress and eager execution. The first assumption means that a program produces as much output as it can given the observed inputs; the second ensures that this output is deterministic. One practical challenge in Flo is the need to demonstrate that operators satisfy both assumptions.
8 Discussion
We have presented our Isabelle formalization of a semantic domain for asynchronous dataflow along with operators that satisfy all but one of Bergstra et al.’s axioms [3]. Our formalization comprises 28 000 lines of definitions and proofs. A major bulk of this work are the coinduction proofs of the 51 axioms, each spanning between 12 and almost 4 000 lines.
In terms of proof assistant technology, we make use of Isabelle’s support for codatatypes, in particular using nested corecursion through the type of countable sets. While we could define all our desired operators, some definitions needed adjustments from their most natural (presented) formulations to be accepted by . Our formalization may give insight to developers on how to make this advanced command even more convenient to use.
Although not visible in the resulting formalization, we made great use of the Sketch and Explore tool when proving the axioms. Typically, we would invoke and obtain a number of subgoals to work on. (In case of there are more than fifty subgoals.) The Sketch and Explore tool presents these subgoals as an Isar proof outline, so that once finished the invocation can become a terminal proof method, which is deemed better proof style.
Our formalization provides reusable artifacts beyond dataflow. In particular, we have outlined an approach for emulating empty types through the use of the type class. We have also sketched how to generate executable code for functions on quotient types with compound domain types. Although we were mostly interested in countable sets, we presented a general technique dualizing the existing solution for a similar problem with subtypes.
An anecdote from our formalization journey is illustrative. After having completed all axiom proofs, we have noticed that we actually proved the version of A5, which according to Bergstra et al. is not expected to hold in all models. This made us suspicious and we were inspecting the axioms extra carefully. The proof assistant did not help here: Isabelle had happily accepted our proofs. It turned out that we made a mistake in statements: instead of reassociating sum types we were rotating them, which has trivialized several axioms. We rushed to fix the mistake and proved all axioms but for and . After some (useful) struggle with Isabelle, we realized that our definition of deviated from Bergstra et al.’s: instead of replacing non-equal pairs by we were dropping such pairs entirely. Fixing meant that we had to revisit all our proofs involving it, including . Here, Isabelle turned out to be tremendously helpful: the proof structure did not change and local fixes (in places Isabelle pointed to) were sufficient to bring us back on track. We finally also managed to prove for (but not for ).
We are working on proving for using trace equivalence. As future work, we will formalize the history model (i.e., give operators a semantics in terms of input/output relations on lazy lists) and demonstrate that it suffers from the Brock–Ackermann anomaly [8, 19]. We have already formalized the trace semantics, which constitutes Jonsson’s workaround for this issue [19]. However, a characterization of traces for composition and feedback in terms of their arguments’ traces is missing. As a long term objective, we want to make time a first-class citizen in our dataflows, so that we can formalize programs expressed in data stream processing frameworks such as Timely Dataflow [26]. Our work provides a solid foundation for asynchronous dataflow into which we will incorporate promising initial progress in that direction [30, 10].
References
- [1] Jesper Bengtson. Formalising process calculi. PhD thesis, Uppsala University, 2010.
- [2] Jan A. Bergstra and Jan Willem Klop. Process algebra for synchronous communication. Inf. Control., 60(1-3):109–137, 1984. doi:10.1016/S0019-9958(84)80025-X.
- [3] Jan A. Bergstra, Cornelis A. Middelburg, and Gheorghe Stefanescu. Network algebra for asynchronous dataflow. Int. J. Comput. Math., 65(1-2):57–88, 1997. doi:10.1080/00207169708804599.
- [4] Jan A. Bergstra, Cornelis A. Middelburg, and Gheorghe Stefanescu. Network algebra for synchronous dataflow. CoRR, abs/1303.0382, 2013. arXiv:1303.0382.
- [5] Jasmin Christian Blanchette, Aymeric Bouzy, Andreas Lochbihler, Andrei Popescu, and Dmitriy Traytel. Friends with benefits - implementing corecursion in foundational proof assistants. In Hongseok Yang, editor, ESOP 2017, volume 10201 of LNCS, pages 111–140. Springer, 2017. doi:10.1007/978-3-662-54434-1_5.
- [6] Jasmin Christian Blanchette, Johannes Hölzl, Andreas Lochbihler, Lorenz Panny, Andrei Popescu, and Dmitriy Traytel. Truly modular (co)datatypes for Isabelle/HOL. In Gerwin Klein and Ruben Gamboa, editors, ITP 2014, volume 8558 of LNCS, pages 93–110. Springer, 2014. doi:10.1007/978-3-319-08970-6_7.
- [7] Timothy Bourke, Lélio Brun, Pierre-Évariste Dagand, Xavier Leroy, Marc Pouzet, and Lionel Rieg. A formally verified compiler for Lustre. In Albert Cohen and Martin T. Vechev, editors, PLDI 2017, pages 586–601. ACM, 2017. doi:10.1145/3062341.3062358.
- [8] J Dean Brock and William B Ackerman. Scenarios: A model of non-determinate computation. In Formalization of Programming Concepts: International Colloquium Peniscola, Spain, April 19–25, 1981 Proceedings, pages 252–259. Springer, 1981. doi:10.1007/3-540-10699-5_102.
- [9] Manfred Broy and Gheorghe Stefanescu. The algebra of stream processing functions. Theor. Comput. Sci., 258(1-2):99–129, 2001. doi:10.1016/S0304-3975(99)00322-9.
- [10] Matthias Brun, Sára Decova, Andrea Lattuada, and Dmitriy Traytel. Verified progress tracking for Timely Dataflow. In Liron Cohen and Cezary Kaliszyk, editors, ITP 2021, volume 193 of LIPIcs, pages 10:1–10:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ITP.2021.10.
- [11] Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, and Val Tannen. DBSP: automatic incremental view maintenance for rich query languages. Proc. VLDB Endow., 16(7):1601–1614, 2023. doi:10.14778/3587136.3587137.
- [12] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache Flink™: Stream and batch processing in a single engine. IEEE Data Eng. Bull., 38(4):28–38, 2015. URL: http://sites.computer.org/debull/A15dec/p28.pdf.
- [13] Paul Caspi, Daniel Pilaud, Nicolas Halbwachs, and John Plaice. Lustre: A declarative language for programming synchronous systems. In POPL 1987, pages 178–188. ACM Press, 1987. doi:10.1145/41625.41641.
- [14] Nicolas Chappe, Paul He, Ludovic Henrio, Yannick Zakowski, and Steve Zdancewic. Choice trees: Representing nondeterministic, recursive, and impure programs in Coq. Proc. ACM Program. Lang., 7(POPL):1770–1800, 2023. doi:10.1145/3571254.
- [15] Simon Foster, Chung-Kil Hur, and Jim Woodcock. Formally verified simulations of state-rich processes using interaction trees in Isabelle/HOL. In Serge Haddad and Daniele Varacca, editors, CONCUR 2021, volume 203 of LIPIcs, pages 20:1–20:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CONCUR.2021.20.
- [16] Florian Haftmann, Alexander Krauss, Ondrej Kuncar, and Tobias Nipkow. Data refinement in Isabelle/HOL. In Sandrine Blazy, Christine Paulin-Mohring, and David Pichardie, editors, ITP 2013, volume 7998, pages 100–115. Springer, 2013. doi:10.1007/978-3-642-39634-2_10.
- [17] Florian Haftmann and Tobias Nipkow. Code generation via higher-order rewrite systems. In Matthias Blume, Naoki Kobayashi, and Germán Vidal, editors, FLOPS 2010, volume 6009 of LNCS, pages 103–117. Springer, 2010. doi:10.1007/978-3-642-12251-4_9.
- [18] Brian Huffman and Ondrej Kuncar. Lifting and transfer: A modular design for quotients in Isabelle/HOL. In Georges Gonthier and Michael Norrish, editors, CPP 2013, volume 8307 of LNCS, pages 131–146. Springer, 2013. doi:10.1007/978-3-319-03545-1_9.
- [19] Bengt Jonsson. A fully abstract trace model for dataflow and asynchronous networks. Distributed Computing, 7:197–212, 1994. doi:10.1007/BF02280834.
- [20] Cezary Kaliszyk and Christian Urban. Quotients revisited for Isabelle/HOL. In William C. Chu, W. Eric Wong, Mathew J. Palakal, and Chih-Cheng Hung, editors, SAC 2011, pages 1639–1644. ACM, 2011. doi:10.1145/1982185.1982529.
- [21] Martin Kleppmann and Jay Kreps. Kafka, Samza and the Unix philosophy of distributed data. IEEE Data Eng. Bull., 38(4):4–14, 2015. URL: http://sites.computer.org/debull/A15dec/p4.pdf.
- [22] Ondřej Kunčar. Types, abstraction and parametric polymorphism in higher-order logic. PhD thesis, Fakultät für Informatik, Technische Universität München, 2016.
- [23] Shadaj Laddad, Alvin Cheung, Joseph M. Hellerstein, and Mae Milano. Flo: A semantic foundation for progressive stream processing. Proc. ACM Program. Lang., 9(POPL):241–270, 2025. doi:10.1145/3704845.
- [24] Andreas Lochbihler and Pascal Stoop. Lazy algebraic types in Isabelle/HOL, 2018.
- [25] Robin Milner. Communication and concurrency. PHI Series in computer science. Prentice Hall, 1989.
- [26] Derek Gordon Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. Naiad: a timely dataflow system. In Michael Kaminsky and Mike Dahlin, editors, SOSP 2013, pages 439–455. ACM, 2013. doi:10.1145/2517349.2522738.
- [27] Lawrence C. Paulson. A Fixedpoint Approach to (Co)Inductive and (Co)Datatype Definitions, pages 187–211. MIT Press, Cambridge, MA, USA, 2000.
- [28] Damien Pous. Weak bisimulation up to elaboration. In Christel Baier and Holger Hermanns, editors, CONCUR 2006, volume 4137 of LNCS, pages 390–405. Springer, 2006. doi:10.1007/11817949_26.
- [29] Damien Pous. New up-to techniques for weak bisimulation. Theor. Comput. Sci., 380(1-2):164–180, 2007. doi:10.1016/J.TCS.2007.02.060.
- [30] Rafael Castro Gonçalves Silva and Dmitriy Traytel. Time-aware stream processing in Isabelle/HOL. In Tobias Nipkow, Lawrence C. Paulson, and Makarius Wenzel, editors, Isabelle workshop 2024, pages 1–19, 2024. URL: https://files.sketis.net/Isabelle_Workshop_2024/Isabelle_2024_paper_8.pdf (accessed: July 1, 2025).
- [31] Rafael Castro Gonçalves Silva, Laouen Fernet, and Dmitriy Traytel. Nondeterministic asynchronous dataflow in Isabelle/HOL artifact, June 2025. doi:10.5281/zenodo.15737868.
- [32] Makarius Wenzel et al. The Isabelle/Isar reference manual, 2004.
- [33] Li-yao Xia, Yannick Zakowski, Paul He, Chung-Kil Hur, Gregory Malecha, Benjamin C. Pierce, and Steve Zdancewic. Interaction trees: representing recursive and impure programs in Coq. Proc. ACM Program. Lang., 4(POPL):51:1–51:32, 2020. doi:10.1145/3371119.
- [34] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. Discretized streams: fault-tolerant streaming computation at scale. In Michael Kaminsky and Mike Dahlin, editors, SOSP 2013, pages 423–438. ACM, 2013. doi:10.1145/2517349.2522737.
