RMR-Efficient Detectable Objects for Persistent Memory and Their Applications
Abstract
We describe a novel construction of arbitrary read-modify-write (RMW) primitives in a persistent shared memory model with process failures. Our construction uses blocking synchronization, in the form of recoverable mutual exclusion (RME), and is optimal in terms of the widely studied remote memory reference (RMR) complexity measure. The implemented objects tolerate either system-wide or independent process crashes, depending on the RME lock used, and also provide detectability for resolving the outcome of operations interrupted by failures. We prove that our construction is RMR-optimal using a reduction back to the RME problem. Our proof technique introduces a novel algorithmic style that enables solving challenging synchronization problems using a common execution path for both the system-wide and independent failure models, which previously required separate analyses, and relies only on a suitable implementation of the detectable base objects in each model to achieve RMR efficiency. Experiments demonstrate that our construction outperforms prior wait-free and lock-free algorithms on a multiprocessor with Intel Optane persistent memory.
Keywords and phrases:
persistent memory, synchronization, recoverability, fault tolerance, detectability, scalability, RMR complexity, theory, mutual exclusionFunding:
Wojciech Golab: Supported by an Ontario Early Researcher Award, a Google Faculty Research Award, and the Natural Sciences and Engineering Research Council (NSERC) of Canada.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Concurrent algorithmsAcknowledgements:
We are grateful to the anonymous reviewers for their valuable feedback.Editors:
Silvia Bonomi, Letterio Galletta, Etienne Rivière, and Valerio SchiavoniSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Over the last decade, the emergence of persistent main memory has reinvigorated research on shared memory algorithms. The ability to retain data across power failures and system crashes opens the door to faster recovery of in-memory data structures than is possible using the traditional approach of rebuilding such structures using state saved in slower secondary storage, and has precipitated a thorough reexamination of established theoretical abstractions. The persistent memory revolution has so far witnessed several reformulations of classic correctness properties and synchronization problems (e.g., [7, 31, 3, 24, 20]) into their recoverable counterparts, as well as the birth of more novel concepts such as Friedman, Herlihy, Marathe and Petrank’s notion of detectability [18, 19] – the capability to forensically determine the outcome of an operation that was interrupted by a failure.
The blossoming landscape of problems and solutions hints at a bright future for algorithmic research on persistent main memory, though the full perspective remains difficult to comprehend given limited connections between emerging concepts. This research takes important steps towards filling this gap by exploring the relationship between detectability, which we formalize using sequential specifications [6, 37, 39], and the recoverable mutual exclusion (RME) problem [24, 25], which is a modern take on the long-standing question of how to implement fault-tolerant mutex locks. In a nutshell, RME generalizes Dijkstra’s mutual exclusion (ME) problem [16] by allowing processes to crash at arbitrary times and recover by starting over. While it makes sense intuitively that objects of arbitrary types can be constructed using RME locks, existing literature focuses on recovering the internal state of the lock and overlooks the fundamental problem of recovering actions performed in the critical section. At best, compositions of RME locks have been considered where such actions are largely idempotent, and can be recovered by repeating the critical section. The use of detectable objects to solve RME is similarly unexplored, and we view this as a missed opportunity since the treatment of such problems from first principles is notoriously challenging (e.g., see [22, 34]). Advances along new research directions that simplify reasoning about failures and concurrency are urgently needed as the theory community takes aim at more complex synchronization problems in the realm of persistent memory.
Implementing detectable objects correctly and using them efficiently is itself technically challenging. Fundamentally, such object implementations face the problem that modern memory hierarchies combine persistent media, such as Intel Optane persistent memory, with volatile CPU registers that are used by primitive memory operations to return responses to the application program. In other words, it is not possible for a program to simply “look up” the outcome of its last memory operation before a failure, or even determine exactly which operation was the last one. Despite this, detectable objects must track the progress of operations across failures so that any process can resolve the status of its most recent operation, even if the corresponding state transition has already been overwritten by an operation applied by another process. Further perils lie ahead when detectable objects are used to solve more complex synchronization problems because such objects achieve only a partial separation of concerns between fault tolerance and concurrency control. To begin with, a recoverable algorithm must decide which of its possibly many detectable base objects need to be resolved, and in which order. Additional recovery state that is kept outside of the base objects (e.g., via checkpoint variables) may also be needed to guide recovery towards a consistent state. On the other hand, once the detectable objects are properly implemented and techniques for their correct use are developed, a new algorithmic style emerges where application-specific imperatives, such as enforcing linearizability [29] or mutual exclusion [16], can be insulated from mundane low-level recovery actions such as determining the outcome of a read-modify-write (RMW) hardware instruction after a failure.
Our main contributions in this paper are two-fold. First (Section 4), we present a universal construction of detectable RMW primitives. Our RME-based construction is the first to implement arbitrary RMW types while using a sublinear number of remote memory references or RMRs – expensive memory operations that traditionally define the time complexity of ME and RME algorithms – while ensuring strict linearizability [1]. We show experimentally that it performs well in comparison to prior lock-free and wait-free techniques. Second (Section 5), we leverage the detectable RMW primitive to devise a novel RMR-optimal RME algorithm, and deduce from the properties of this algorithm a lower bound on the worst-case RMR complexity of any generic construction of detectable RMW primitives. The unique algorithmic style emerging from this reduction opens the door to solving complex problems using a common execution path for both the system-wide and independent failure models, which previously required separate analyses, and relies only on a suitable implementation of the detectable base objects to achieve RMR efficiency in each failure model.
2 Model
We combine Herlihy and Wing’s classic model of linearizable data structures [27] with features borrowed from recent research on crash recovery for persistent memory [7, 31, 25, 23, 3, 37]. A fixed set of asynchronous processes, denoted , communicate by applying operations on shared base objects. Processes may fail by crashing, either independently or simultaneously, and may recover from crashes by restarting their execution from the beginning under the same process ID. Each base object has a sequential specification that describes its possible states and state transitions. We will assume that base object operations are atomic with the understanding that they can be either supported directly in hardware, or emulated in software via shared object implementations that specify an access procedure for each emulated operation. Base objects are persistent, meaning that their states are unaffected by crash failures. In addition, processes may use private volatile variables whose values are reinitialized during recovery from a failure. The algorithms presented in this paper can be adapted to more complex models that consider a volatile cache through judicious insertion of persistence instructions, for example using the transformations described in [31, 13].
Executions of algorithms are modeled using histories, which are sequences of system states and steps. A system state is the combination of values assigned to all program variables, shared and private, including program counters. There are three types of steps: a base object step represents the execution of one shared memory operation by a process along with bounded local computation (e.g., arithmetic and accessing private volatile variables); an invocation or response step represents the invocation of an operation on an implemented object or its response, respectively; and a crash step represents the crash failure of one or more processes. Crashes can be independent (affecting one process) or system-wide (affecting all processes simultaneously). We consider two types of algorithms in this paper: shared object implementations (Section 4) modeled as a collection of access procedures for different implemented operations, and recoverable mutual exclusion (RME) [24] (Section 5) modeled as a special infinite loop. Histories of RME-like algorithms comprise only base object and crash steps. Histories of implementations additionally contain invocation and response steps.
Histories of an algorithm must satisfy safety and liveness properties specific to that algorithm. For RME, where processes repeatedly execute a non-critical section (NCS), recovery section (RS), entry section (ES), critical section (CS), and exit section (XS), the fundamental safety property is mutual exclusion: at most one process can be in the CS at a given time. A bounded prefix of the ES called the doorway controls the order of entry into the CS in first come first serve (FCFS) RME locks. Liveness means that executions of the RS, ES, and XS eventually terminate, and is predicated on two assumptions: (1) the CS is bounded, and (2) the history is fair, meaning that if a process leaves the NCS, it continues taking steps until it completes the XS and returns to the NCS without crashing. The latter implies that if a process crashes during a passage, which is a sequence of consecutive executions of the RS, ES, CS, and XS, then it eventually begins another passage. A maximal sequence of passages by one process where only the last one ends in a complete execution of the XS is called a super-passage. For shared object implementations, we choose strict linearizability [1] as the main correctness property: implemented operations appear to take effect instantaneously between their invocation and response steps, and any operation interrupted by a crash failure must take effect prior to the failure or not at all. Liveness means that every implemented operation eventually returns a response, and is once again predicated on the history being fair, meaning that if a process invokes an operation then it must continue taking steps until the operation produces a response or the process crashes.
Some shared object implementations are detectable [18, 19], meaning that a process can resolve precisely the outcome of an operation that was interrupted by a crash. We formalize detectability using sequential specifications, following the trend established in [6, 37, 39, 32], whereby detection is accomplished by invoking a special resolution operation (called resolve) on the implemented object. Specifically, we adopt the unified sequential specification (UDSS) introduced by Moridi et al. [39], whereby resolve returns a triple representing the most recent detectable operation invoked by the calling process that took effect, or else if no such operation exists. Here is the response of , and is an auxiliary argument passed to to disambiguate successive invocations of that may return the same response (or no response at all).
We quantify complexity in terms of space, measured by counting the number of persistent base objects, and time, measured by counting remote memory references (RMRs) – expensive shared memory operations that necessitate communication on the interconnect that joins processors with memory. In the cache-coherent (CC) multiprocessor architecture, RMRs are closely tied to cache misses and invalidations. For analyzing upper bounds, we assume that every shared memory operation is an RMR unless a process reads an object that it has previously read, and which has not been overwritten since the most recent such read. In the distributed shared memory (DSM) model, each variable is statically defined as local to exactly one process and remote to all others, and an RMR is defined as any access to a remote variable. In special cases, algorithms or portions thereof may be wait-free [28], meaning that they terminate after a bounded number of steps in all executions.
3 Overview of Results
In this research, we derive both upper and lower bounds on the worst-case RMR complexity per operation of detectable RMW objects (more precisely on objects that simultaneously support FAS and CAS) by connecting this topic formally to RME. In terms of upper bounds, our detectable objects have the same RMR complexity asymptotically as their underlying RME locks, and require only additional space. As we discuss in Section 4.2, experiments conducted using Intel Optane persistent memory show that our objects scale better than alternatives obtained using lock-free and wait-free techniques [6, 36] for Fetch-And-Store operations. Our lower bound on RMR complexity pertains to the independent failure model, and shows that implementing detectable objects is as hard as solving the RME problem. We obtain this result by constructing a reduction from the latter problem to the former, and applying existing lower bounds on the RMR complexity of RME [9, 10]. The lower bound is tight, which makes our detectable object construction RMR-optimal for independent failures. The same holds for system-wide failures, where we achieve RMR complexity. As summarized in Tables 1 and 2, we present the first sub-linear (i.e., ) RMR bounds for detectable unconditional RMW operations such as Fetch-and-Store.
4 From RME to Detectable Objects
This section presents a generic RMR-efficient construction of detectable read-modify-write (RMW) primitives using an RME lock as the main building block. This result complements prior work (see Section 6) on detectable constructions of comparison primitives that achieve RMR-efficiency indirectly by guaranteeing wait-freedom. As we show later on in Section 5, detectable FAS and CAS primitives instantiated using our construction can be used to solve the RME problem efficiently and relatively straightforwardly, which establishes equivalence in terms of worst-case RMR complexity (per passage and per operation, respectively) between RME and any generic detectable RMW construction.
4.1 Detailed description and analysis
The sequential specification of the implemented object in our construction is a subset of Moridi et al.’s UDSS [39] applied to a generic Fetch-And- (FA) operation that returns the previous abstract state of the object and applies a state transition according to the user-specified function . In formal terms, we can regard as a parameter of the implemented operation, which enables the application of different state transitions at different points in the same execution. For example, Fetch-And-Increment (FAI) can be emulated as , and a Fetch-And-Store (FAS) operation that swaps in value can be emulated as . Compare-And-Swap (CAS) and blind Write operations are also easily emulated. The UDSS transformation maps FA to detectable-FA, which accepts an additional argument, and a resolve operation that returns a triple of the form as explained in Section 2. In a nutshell, the generic construction protects executions of detectable-FA using a recoverable mutex lock . Dealing with crash failures in this construction is nontrivial for two reasons. First, as we explain shortly, the critical section established by comprises multiple memory operations and requires cleanup after a crash. Second, a process that crashes during a passage through is required (by the specification of the RME problem) to restart and complete a failure-free passage. We address both concerns by adding recovery actions in the resolve operation, and by ensuring that this operation is executed after a failure via the following access restriction:
Assumption 1.
If a process fails inside detectable-FA then it must eventually complete a call to resolve, and moreover it must do so before invoking any other operation on the same object.
The construction records the abstract state of the implemented object in a shared variable with state transitions occurring at line 18. is initialized to , the initial abstract value of the implemented object. Additional state needed for detectability is maintained using a collection of arrays. Process uses , and at lines 6–11 to record the signature of its next detectable operation (i.e., function ), the corresponding tag argument, and the value fetched from , respectively. In case the operation is interrupted by a crash and needs to be rolled back for strict linearizability, first creates a secondary copy of these state variables in , , and , respectively, at lines 2–4. The decision to roll back is informed by the value of the checkpoint variable , inspired by [25], which uses to track its progress through the critical section at lines 1, 5, 17, and 20. An additional array is used to record the last value read from at line 12, which simplifies recovery. Finally, we use a global variable and an array to implement a helping mechanism that synchronizes the critical section in detectable-FA with resolve. The variable records the ID of the process executing the critical section at line 15, and is used at lines 13–14 to pass the previous value of to the last process that completed the CS up to and including line 15.
The operation resolve starts with a passage through at lines 23–25, which is required only to repair the internal state of . The remainder of resolve is structured as a case analysis based primarily on the value of . Its goal is to ensure that , and record the correct details of ’s last detectable operation that took effect. No special action is required if or . The case (line 28) indicates that a detectable operation failed before it took effect but after its details were already recorded in , and . These state changes are rolled back at lines 29–31 to maintain strict linearizability. The case (line 33) indicates that a detectable operation failed close to the point where the abstract state is updated at line 18, and is the most challenging to resolve. Process must deduce correctly whether its last operation took effect, even if other processes subsequently applied their own detectable-FA operations. The key algorithmic idea is for to obtain a copy of the abstract state as of the time of its failure, and compare it against the value fetched from and saved in at line 12. The copy is obtained by reading at line 36 if (line 35), which indicates that another process already completed the helping mechanism at lines 13–14, otherwise by reading directly line 34. If the copy and saved value match (line 37) then either ’s last detectable-FA operation did not update at line 18, or it did update but caused a trivial state transition via a fixed point of the function . We consider that ’s last detectable-FA operation did not take effect in both cases, and roll back ’s state changes at lines 38–40 similarly to the case . Otherwise, the copy and saved value are unequal, which implies that did update . In that case ’s last detectable-FA operation did take effect and the correct values are already stored in , and . All cases in the analysis finally reset to 0, if needed, and then produce the response at line 42.
The main correctness properties of the construction are stated in Theorem 1 and Theorem 2. Detailed proofs are included in Appendix A.
Theorem 1.
The generic construction is strictly linearizable.
Theorem 2.
Every operation applied to the generic construction terminates eventually in any infinite fair execution history. Furthermore, each operation incurs RMRs where denotes the worst-case RMR complexity of the recoverable mutex .
The detectable construction can be optimized in several ways. To begin with, a wait-free read operation can be implemented by simply reading directly, without acquiring the RME lock. Furthermore, lines 23–25 of resolve can be replaced with a wait-free withdrawal operation, if supports it (e.g., [14]), which cleans up the state of the RME lock without acquiring the CS and makes the entire resolution procedure wait-free. (Wait-free resolution is important for avoiding deadlocks during recovery from crash failures.) Finally, we can dispense with the RME lock altogether and reduce the RMR (and step) complexity to if only one process at a time executes the detectable-FA operation, which is relevant in scenarios where the construction emulates a single-writer multi-reader register with wait-free read. Additional ideas related to application of detectable base objects in FCFS-fair RME locks are discussed in Appendix B.
Note that, depending on which RME lock is used to protect the CS of detectable-FA, our implementation may satisfy additional desirable properties. We summarize these in Table 3 for some of the existing RME locks. Specifically, for the independent failure model, if Dhoked and Mittal’s RME lock [15] is used, our implementation becomes adaptive to failures, and has only RMR complexity as long as only a small number of failures occur.
4.2 Experimental Evaluation
We implemented our detectable construction in C++ using the Intel Persistent Memory Development Kit (PMDK) [44], and evaluated it on a 4-socket Intel Xeon Gold 6230 multiprocessor with Optane persistent memory. We compare against two alternatives: a lock-free emulation of FAS obtained using Ben-David, Blelloch, Friedman and Wei’s wait-free detectable Compare-And-Swap (CAS) [6]; and a direct implementation of FAS using Lehman, Attiya and Hendler’s partly wait-free detectable swap [36].111 Lehman, Attiya and Hendler’s construction uses a blocking recovery algorithm for independent failures, and can be considered fully wait-free in the system-wide failure model with centralized recovery. Our construction internally uses Dhoked, Golab, and Mittal’s withdrawable RME lock [14], which incurs RMRs in the system-wide failure model. We consider two variations of the construction, one corresponding to the code in Figure 1 and the other leveraging critical section re-entry (CSR), which makes it possible to eliminate the helping mechanism in detectable-FA. The latter optimization sacrifices the possibility of wait-free resolution as it requires a critical section inside resolve.
Both our construction and the alternative algorithms were augmented with persistence barriers (i.e., cache line write-backs and store fences) to control the durability and visibility of shared memory operations. The barriers were implemented by calling pmem_persist in the PMDK. The transformation of Izraelevitz, Mendes, and Scott [31] mandates placing such barriers after each memory operation, which is aggressive but appropriate for the lock-free and wait-free algorithms based on [6, 36]. However, this technique does not preserve the RMR complexity of busy-wait loops on our hardware platform, where persistence barriers always invalidate cache lines.222 Newer Intel Xeon processors offer a cache line writeback (CLWB) instruction that may either retain the cache line or invalidate it. Our processor model always invalidates the cache line [43]. For this reason, we manually augmented the two implementations of our construction with barriers. Our manual technique relaxes Izraelevitz, Mendes, and Scott’s recipe by using only one barrier per busy-wait loop. We also avoid barriers in the special case where a process reads a single-writer variable for which it is the exclusive writer. Additionally, we consider a more advanced persistence technique, implemented in the FliT library [47], that can effectively avoid redundant persistence operations while reading variables.
Each thread in the experiment repeatedly applies detectable FAS operations on a common shared object. Total throughput (operations/s) is measured at different levels of parallelism to evaluate scalability up to 160 threads: 4 processors with 20 hyperthreaded cores. Each point in our plots represents the average over a sample of three failure-free runs, each lasting 5s. Error bars (often too small to be seen) indicate one sample standard deviation. Runs with up to 20 threads collocate all the threads on a single processor for efficient synchronization, and larger experiments exhibit classic NUMA (Non-Uniform Memory Access) effects due to expensive cross-socket communication. Hyperthreading is used in runs with 81-160 threads.
Figure 2 presents the scalability of implementations with aggressive persistence based on variations of [31] (i.e., without FliT). The results demonstrate that using a non-CSR RME lock in our construction with a helping mechanism is mostly superior to using a CSR RME lock without helping. Both alternatives substantially outperform the recoverable swap [36], which has bounded but linear RMR complexity in failure-free executions, and which we instantiated with . The lock-free implementation based on [6] exhibits the worst scalability, which is expected since FAS operations are contention-prone.
Next, we introduce centralized recovery (i.e., recovery in the main function of the benchmark program), which allows two optimizations. First, our construction can leverage an efficient failure detector-based RME lock [23] that mostly eliminates persistence barriers. Second, we can apply the hash table-based persistence mechanism provided by the FliT library [47] to minimize redundant persistence instructions for read-heavy shared variables. Figure 3 compares our construction with the optimized non-CSR lock against a FliT-optimized recoverable swap (FliT persistence applied to the variable and array in [36]), and against the lock-free implementation based on [6].333 The FliT library could not be applied to [6] due to missing support for double word atomic operations. Consequently, we use the same implementation of CAS-based FAS in Figure 3 as in Figure 2. The results, presented in Figure 2, show significant speed-ups for the optimized algorithms as compared to their counterparts in Figure 2. Our construction still leads in terms of throughput, especially at 25 threads where FliT’s internal atomic counters suffer noticeably from NUMA effects.
5 From Detectable Objects to RME
This section presents a lower bound on the RMR complexity of detectable objects. We obtain the bound by devising a novel RME algorithm that uses detectable objects for synchronization, and then applying known lower bounds for the RMR complexity of the RME problem [9, 10]. Although the algorithm does not break new ground in terms of RMR and space complexity for the RME problem, its design embodies a novel algorithmic style that may be of independent interest: a common execution path is used for both the system-wide and independent failure models, which previously required separate designs and analyses. The generic algorithm is then instantiated for a particular failure model by substituting a suitable implementation of the detectable base objects, which determines the overall RMR complexity.
Our algorithm, presented in Figure 4 is an adaptation of Mellor-Crummey and Scott’s (MCS) venerated queue lock [38], which represents attempts to enter the CS using a singly linked list of QNode structures. In this section we consider specifically the context-free version of MCS, in which nodes are allocated at line 58 of the entry section. line 61–71 follow the original MCS algorithm faithfully with the exception of the read-modify-write operations at line 62 and line 68, where we introduce detectability. The tail pointer variable implements a subset444The full UDSS would also preserve non-detectable FAS and CAS operations, which we do not need. of Moridi et al.’s UDSS [39] applied to the combination of FAS and CAS operations: it supports detectable FAS and CAS operations that accept an auxiliary argument called a tag [39], in our case a QNode reference, and the resolution operation described earlier in Section 2. The latter is invoked at line 45 of the recovery section only if a process crashed between line 59 and line 72, where a reference to the current node is saved for recovery, and returns a triple as explained in Section 2. This triple encodes the name of the last operation as , the value of the auxiliary argument as , and the response of the operation as . The recovery section performs a case analysis on the values of these components, as we explain shortly, to decide the correct way to clean up the last passage.
A crash failure of process occurring with (i.e., before line 59 or after line 72) is benign, and in this case the body of Recover is bypassed at line 44. More difficult cases occur if crashes after threading its node behind the tail at line 62, and before it has passed control of the RME lock to the successor (i.e., completed line 71) or unthreaded its node from the queue (i.e., completed a successful swap at line 68). If ’s most recent operation on the tail pointer with respect to the current node (read at line 43) was detectable-FAS then must complete its interrupted execution of the entry section. If already linked with the predecessor, then it repeats the busy-wait loop in Enter and then completes Exit via line 50 and line 53. On the other hand, if there was a predecessor but the link was not established then repeats Enter from line 64 and then completes Exit via line 52 and line 53. Finally, if there was no predecessor then simply completes Exit via line 53. Next, there are two cases pertaining to a crash in Exit. If successfully unthreaded its node from the queue then it proceeds to line 72 via line 55 and releases its node. Otherwise has a predecessor and repeats lines 70–72 of Exit via line 57 to link with and pass control of the RME lock. Note that, in all of the above cases, continues to execute the entry, critical and exit sections of the current passage, and no attempt is made to re-execute the CS from the recovery section. To guarantee critical section re-entry (CSR), the algorithm can be modified easily by having bypass Exit at line 53 and then also bypass the body of Enter using an additional private variable to direct the flow of control.
The space complexity of the algorithm can be bounded by recycling the queue nodes allocated at line 58. We defer the details to Appendix C due to lack of space. Our technique reduces space to while preserving the dynamic joining property: the algorithm allows processes to join an execution “on the fly” and does not assume prior knowledge of .
The correctness properties of the algorithm are stated below in Theorem 3 under the implicit assumption that detectable operations on the tail pointer are atomic. The doorway of the algorithm, for FCFS, is a bounded prefix of the entry section and comprises lines 58–62. If is simulated using a software implementation then the detectable-FAS operation must be wait-free, at least until its linearization point, and we address this technicality more formally in Appendix B.
Theorem 3.
The RME algorithm presented in Figure 4 satisfies the mutual exclusion, starvation freedom, and first-come-first-served (FCFS) properties in both the independent and system-wide crash failure models. Furthermore, it incurs RMRs per passage in both the CC and DSM models where denotes the worst-case RMR complexity of operations on the tail pointer variable .
The upper bound on RMR complexity stated in Theorem 3 can be used to deduce a lower bound on the RMR complexity of implementing detectable base objects in the independent crash failure model, as stated in Theorem 4.
Theorem 4.
A strictly linearizable implementation of an object that supports detectable FAS and CAS operations from base objects that store bits each and support arbitrary single-word read-modify-write operations has a worst case RMR complexity per operation of in both the CC and DSM models with independent crash failures.
Proof.
The algorithm presented in Figure 4 is, informally speaking, a reduction from the RME problem to the problem of implementing detectable base objects. Assuming that the memory management technique from Appendix C is applied, the algorithm requires queue nodes and stores values of size bits for node references. Supposing that the detectable base object is replaced with a strictly linearizable implementation555 Strictly linearizable implementations can be used in place of atomic base objects while preserving all correctness properties relevant to this paper, aside from RMR complexity and space complexity. that internally also uses base objects of size bits, the algorithm solves the starvation-free RME problem and is subject to the lower bound of Chan, Giakkoupis and Woelfel [9] for deadlock-free RME under independent failures. The latter implies that the RME algorithm incurs per passage in the worst case, in both the CC and DSM models. Since portions of the algorithm outside of accesses to incur only RMRs per passage, this implies that in some execution a process incurs in total while accessing in one passage. This implies that the implementation of has a worst case RMR complexity of per operation since there are at most four such operations per passage: one in Enter, one in Exit, and two in Recover.
The approach illustrated in Figure 4 can be contrasted against the algorithms of Golab and Hendler [22] as well as Jayanti, Jayanti and Joshi [34], which use much more elaborate recovery protocols to carefully mend the queue of nodes after a failure. A single detectable base object, namely the tail pointer, simplifies this task to the point where the correct recovery actions can be decided by a fairly simple case analysis, without first identifying and then analyzing a complex collection of disconnected queue fragments. Although the detectable base objects are themselves difficult to implement, which we prove formally in Theorem 4 for independent failures, a generic construction of such objects could be a powerful and reusable building block for the next generation of RME algorithms, including recoverable reader-writer and group mutual exclusion locks.
6 Related Work
The prospect of recovering in-memory data structures directly from persistent main memory after a crash failure opened the door to rethinking some of the fundamental assumptions ingrained in decades of literature on shared memory algorithms. We focus in this section on theoretical aspects of persistent memory, acknowledging that the subject caught the attention of practitioners (e.g., in [46, 11, 41, 30, 42, 12]) before it captivated the theory community.
One of the earliest theoretical issues addressed was the formulation of a suitable correctness condition for concurrent objects in a crash-prone environment. Herlihy and Wing’s celebrated linearizability property [29] has a fundamental drawback in that regard: it assumes that a process finishes one operation before it invokes another. This implies that a crashed process cannot in general restart execution using its old identifier; a new identifier must sometimes be chosen, which goes against the nearly universal modeling assumption that a system comprises a bounded set of processes with fixed identifiers. Several alternatives have been proposed, the most widely-cited of which is the durable linearizability (DL) property of Izraelevitz, Mendes, and Scott [31]. DL simply disallows immediate reuse of process IDs, which has the side-effect of allowing an operation that is interrupted by a crash to take effect after the crash. Aguilera and Frølund’s strict linearizability (SL) property [1] instead requires interrupted operations to take effect either before the crash or not at all. It allows for, but does not require, the reuse of process IDs across failures. Thus, SL is the strongest and most portable of the relevant correctness properties, though not always attainable, as shown in [1] for certain register constructions. Weaker conditions proposed in [26, 7] are susceptible to undesirable anomalies (non-local behavior, program order inversion). Attiya, Ben-Baruch, and Hendler defined nesting-safe recoverable linearizability (NRL) [3] for recovery of nested compositions of objects. Their framework assumes that the system restarts a crashed process from the most deeply nested interrupted operation to complete it, which circumvents the anomalies present in [26, 7].
Traditional linearizability-like notions of correctness were challenged by Friedman, Herlihy, Marathe and Petrank [18, 19], who introduced the notion of detectability – the ability to resolve the outcome of an interrupted operation during recovery from a failure. NRL satisfies this goal indirectly by ensuring that every interrupted operation eventually takes effect, and variations on this theme have been adopted in several works [5, 2, 45, 36, 40]. Alternatively, detectability can be embedded into an object’s sequential specification by introducing a special resolution operation, and then combined with an existing correctness condition. This approach was first realized by Ben-David, Blelloch, Friedman, and Wei [6], and later generalized by Li and Golab [37] as the detectable sequential specification (DSS), without relying on the strong system assumptions of NRL. Moridi, Wang, Cui, and Golab [39] examined producer-consumer synchronization as a concrete use case for detectable objects, and also proposed to merge the distinct prepare and execute phases of a detectable operation in the DSS by introducing the conceptually simpler unified DSS (UDSS). Jayanti, Jayanti, and Jayanti’s [32] method-based approach to detectability is a blend of [3] and [7].
Recoverable (i.e., crash-tolerant) objects can be constructed by augmenting conventional techniques through the addition of specialized recovery procedures and, in some models, explicit persistence instructions that flush updates from the volatile cache to the persistent memory [7, 6, 17, 2, 32, 3, 45, 36, 40, 37, 24, 19, 8]. Of these, only purely lock-based or wait-free techniques [7, 24, 3, 6, 45, 34] ensure bounded RMR complexity, and only a handful yield RMR-efficient (particularly sub-linear RMR complexity in the number of processes) implementations. The latter generally fall into three categories: solutions based on Golab and Ramaraju’s recoverable mutual exclusion (RME) [24, 25], for which RMR-optimal solutions were later presented in [33, 13, 22, 34], wait-free implementations of comparison primitives such as Compare-And-Swap (CAS) and Load-Linked/Store-Conditional (LL/SC) [3, 6, 32], and universal wait-free constructions [7]. RME does not prescribe a concrete technique for recovering actions performed in the critical section, and hence only partially addresses the implementation problem solved in this paper. On the other hand, detectable wait-free implementations of comparison primitives compete directly with our RME-based generic construction, but they have limited applicability in the context of proving RMR complexity bounds due to their weak symmetry breaking power. For example, the RMR-optimal algorithms in [33, 13, 22, 34] rely on unconditional synchronization primitives, such as Fetch-And-Store (FAS), to maintain queue structures. The wait-free CAS in [6] is strictly linearizable [1], but the constructions in [7, 3, 32] are not.
Tables 1 and 2 summarize the performance guarantees of the existing implementations of detectable object operations for system-wide failures (Table 1) and independent failures (Table 2). For strict linearizability, as mentioned earlier, the implementation provides a resolve procedure to detect if the operation interrupted by a failure has taken effect and perform any needed recovery. For recoverable linearizability, the implementation provides a recover procedure to enable an operation interrupted by a failure to be completed.
Tight bounds on worst-case RMR complexity in the context of persistent shared memory are known for the RME problem in both the CC and DSM models. For independent failures, Golab and Ramaraju’s algorithm [24, 25] solves RME in RMRs per passage using read/write registers only, and this matches Attiya, Hendler, and Woelfel’s lower bound [4], which applies also to comparison primitives [21]. Using unconditional primitives such as FAS, in combination with CAS and read/write registers, the RME problem is solvable in RMRs per passage using Jayanti, Jayanti, and Joshi’s algorithm [34], which uses the tree structure introduced by Golab and Hendler [22]. This matches the lower bound of Chan and Woelfel [10] for algorithms that use base objects of size bits. Katzan and Morrison [35] showed that more efficient solutions are possible using wider base objects, and the trade-off achieved was proved to be tight by Chan, Giakkoupis, and Woelfel [9]. In the system-wide failure model, RMR bounds for algorithms that use reads, writes, and comparison primitives are the same as in the independent failure model. On the other hand, FAS in combination with CAS and read/write registers can be used to construct efficient queue locks with RMRs complexity using Jayanti, Jayanti, and Joshi’s algorithm [33] or Dhoked, Golab, and Mittal’s [14]. Our transformation of ME algorithms to RME using detectable base objects proves for the first time that an RMR-optimal generic construction of detectable RMW objects has the same worst-case RMR complexity per operation as the RME problem per passage in the system-wide and independent failure models.
Our detectable object construction is an important step towards a general technique to transform a non-recoverable concurrent program into one that leverages persistent memory for recoverability. Several such transformation already exist. Ben-David, Blelloch, Friedman, and Wei’s technique is based on segmenting the program into capsules that can be recovered using detectable operations [6]. Their technique is presented in the context of lock-free synchronization using Compare-And-Swap, but can be extended to support other detectable RMW operations.666Note that, since our implementation of the detectable-FA operation is blocking, the concurrent program generated by Ben-David, Blelloch, Friedman, and Wei’s transformation when extended to use our generic detectable-FA operation will be safe but no longer lock-free. Its main limitation is the assumption of a persistent call stack and page table, which requires a non-standard programming environment and operating system. Attiya, Ben-Baruch, Fatourou, Hendler, and Kosmas describe a general construction of recoverable data structures using persistent memory [2]. Their approach is specific to lock-free algorithms that use only read, write and CAS instructions.
7 Conclusion
Our work draws the first comprehensive connection between detectability and recoverable mutual exclusion. We showed how to use RME to construct detectable read-modify-write objects, and then used such objects to solve the RME problem using an algorithm that provides a common execution path for both system-wide and independent process failure models. Our constructions establish the first RMR lower bound on detectable object implementations, and pave the way toward future RMR-efficient solutions for complex synchronization problems such as recoverable versions of reader-writer exclusion and group mutual exclusion.
References
- [1] Marcos K. Aguilera and S. Frølund. Strict linearizability and the power of aborting. Technical Report HPL-2003-241, Hewlett-Packard Labs, 2003.
- [2] Hagit Attiya, Ohad Ben-Baruch, Panagiota Fatourou, Danny Hendler, and Eleftherios Kosmas. Tracking in order to recover - detectable recovery of lock-free data structures. In Proc. of the 32nd ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 503–505, 2020. doi:10.1145/3350755.3400257.
- [3] Hagit Attiya, Ohad Ben-Baruch, and Danny Hendler. Nesting-safe recoverable linearizability: Modular constructions for non-volatile memory. In Proc. of the 37th ACM Symposium on Principles of Distributed Computing (PODC), pages 7–16, 2018. doi:10.1145/3212734.3212753.
- [4] Hagit Attiya, Danny Hendler, and Philipp Woelfel. Tight RMR lower bounds for mutual exclusion and other problems. In Proc. of the 40th ACM Symposium on Theory of Computing (STOC), pages 217–226, 2008. doi:10.1145/1374376.1374410.
- [5] Ohad Ben-Baruch, Danny Hendler, and Matan Rusanovsky. Upper and lower bounds on the space complexity of detectable objects. In Proc. of the 39th ACM Symposium on Principles of Distributed Computing (PODC), pages 11–20, 2020. doi:10.1145/3382734.3405725.
- [6] Naama Ben-David, Guy E. Blelloch, Michal Friedman, and Yuanhao Wei. Delay-free concurrency on faulty persistent memory. In Proc. of the 31st ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 253–264, 2019. doi:10.1145/3323165.3323187.
- [7] Ryan Berryhill, Wojciech Golab, and Mahesh Tripunitara. Robust shared objects for non-volatile main memory. In Proc. of the 19th International Conference on Principles of Distributed Systems (OPODIS), pages 20:1–20:17, 2016.
- [8] Trevor Brown and Hillel Avni. Phytm: Persistent hybrid transactional memory. Proc. VLDB Endow., 10(4):409–420, 2016. doi:10.14778/3025111.3025122.
- [9] David Yu Cheng Chan, George Giakkoupis, and Philipp Woelfel. Word-size RMR tradeoffs for recoverable mutual exclusion. In Proc. of the 42th ACM Symposium on Principles of Distributed Computing (PODC), pages 79–89, 2023. doi:10.1145/3583668.3594597.
- [10] David Yu Cheng Chan and Philipp Woelfel. Tight lower bound for the RMR complexity of recoverable mutual exclusion. In Proc. of the 40th ACM Symposium on Principles of Distributed Computing (PODC), pages 533–543, 2021. doi:10.1145/3465084.3467938.
- [11] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. Nv-heaps: making persistent objects fast and safe with next-generation, non-volatile memories. In Proc. of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 105–118, 2011. doi:10.1145/1950365.1950380.
- [12] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin C. Lee, Doug Burger, and Derrick Coetzee. Better I/O through byte-addressable, persistent memory. In Proc. of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pages 133–146, 2009. doi:10.1145/1629575.1629589.
- [13] Sahil Dhoked, Wojciech Golab, and Neeraj Mittal. Modular recoverable mutual exclusion under system-wide failures. In Proc. of the 37th International Symposium on Distributed Computing (DISC), pages 17:1–17:24, 2023. doi:10.4230/LIPICS.DISC.2023.17.
- [14] Sahil Dhoked, Wojciech Golab, and Neeraj Mittal. Modular recoverable mutual exclusion under system-wide failures. In Proc. of the 37th International Symposium on Distributed Computing (DISC), pages 17:1–17:24, 2023. doi:10.4230/LIPICS.DISC.2023.17.
- [15] Sahil Dhoked and Neeraj Mittal. An adaptive approach to recoverable mutual exclusion. In Proc. of the 39th ACM Symposium on Principles of Distributed Computing (PODC), pages 1–10, New York, NY, USA, 2020. doi:10.1145/3382734.3405739.
- [16] Edsger W. Dijkstra. Solution of a problem in concurrent programming control. Communications of the ACM, 8(9):569, 1965. doi:10.1145/365559.365617.
- [17] Michal Friedman, Naama Ben-David, Yuanhao Wei, Guy E. Blelloch, and Erez Petrank. Nvtraverse: in NVRAM data structures, the destination is more important than the journey. In Proc. of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), pages 377–392, 2020. doi:10.1145/3385412.3386031.
- [18] Michal Friedman, Maurice Herlihy, Virendra J. Marathe, and Erez Petrank. Brief announcement: A persistent lock-free queue for non-volatile memory. In Proc. of the 31st International Symposium on Distributed Computing (DISC), volume 91, pages 50:1–50:4, 2017. doi:10.4230/LIPICS.DISC.2017.50.
- [19] Michal Friedman, Maurice Herlihy, Virendra J. Marathe, and Erez Petrank. A persistent lock-free queue for non-volatile memory. In Proc. of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 28–40, 2018. doi:10.1145/3178487.3178490.
- [20] Wojciech Golab. The recoverable consensus hierarchy. In Proc. of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 281–291, 2020. doi:10.1145/3350755.3400212.
- [21] Wojciech Golab, Vassos Hadzilacos, Danny Hendler, and Philipp Woelfel. RMR-efficient implementations of comparison primitives using read and write operations. Distributed Computing, 25(2):109–162, 2012. doi:10.1007/S00446-011-0150-8.
- [22] Wojciech Golab and Danny Hendler. Recoverable mutual exclusion in sub-logarithmic time. In Proc. of the 36th ACM Symposium on Principles of Distributed Computing (PODC), pages 211–220, 2017. doi:10.1145/3087801.3087819.
- [23] Wojciech Golab and Danny Hendler. Recoverable mutual exclusion under system-wide failures. In Proc. of the 37th ACM Symposium on Principles of Distributed Computing (PODC), pages 17–26, 2018. doi:10.1145/3212734.3212755.
- [24] Wojciech Golab and Aditya Ramaraju. Recoverable mutual exclusion. In Proc. of the 35th ACM Symposium on Principles of Distributed Computing (PODC), pages 65–74, 2016.
- [25] Wojciech Golab and Aditya Ramaraju. Recoverable mutual exclusion. Distributed Computing, 32(6):535–564, 2019. doi:10.1007/S00446-019-00364-0.
- [26] Rachid Guerraoui and Ron R. Levy. Robust emulations of shared memory in a crash-recovery model. In Proc. of the 24th International Conference on Distributed Computing Systems (ICDCS), pages 400–407, 2004. doi:10.1109/ICDCS.2004.1281605.
- [27] M. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463–492, July 1990. doi:10.1145/78969.78972.
- [28] Maurice Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124–149, 1991. doi:10.1145/114005.102808.
- [29] Maurice Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463–492, 1990. doi:10.1145/78969.78972.
- [30] Joseph Izraelevitz, Terence Kelly, and Aasheesh Kolli. Failure-atomic persistent memory updates via JUSTDO logging. In Proc. of the 21s International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 427–442, 2016. doi:10.1145/2872362.2872410.
- [31] Joseph Izraelevitz, Hammurabi Mendes, and Michael L. Scott. Linearizability of persistent memory objects under a full-system-crash failure model. In Proc. of the 30th International Symposium on Distributed Computing (DISC), pages 313–327, 2016. doi:10.1007/978-3-662-53426-7_23.
- [32] Prasad Jayanti, Siddhartha Jayanti, and Sucharita Jayanti. Durable algorithms for writable LL/SC and CAS with dynamic joining. In Proc. of the 37th International Symposium on Distributed Computing (DISC), pages 25:1–25:20, 2023. doi:10.4230/LIPICS.DISC.2023.25.
- [33] Prasad Jayanti, Siddhartha Jayanti, and Anup Joshi. Constant RMR system-wide failure resilient durable locks with dynamic joining. In Proc. of the 35th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 227–237, 2023. doi:10.1145/3558481.3591100.
- [34] Prasad Jayanti, Siddhartha V. Jayanti, and Anup Joshi. A recoverable mutex algorithm with sub-logarithmic RMR on both CC and DSM. In Proc. of the 38th ACM Symposium on Principles of Distributed Computing (PODC), pages 177–186, 2019. doi:10.1145/3293611.3331634.
- [35] Daniel Katzan and Adam Morrison. Recoverable, abortable, and adaptive mutual exclusion with sublogarithmic RMR complexity. In Proc. of the 24th International Conference on Principles of Distributed Systems (OPODIS), pages 15:1–15:16, 2021.
- [36] Tomer Lev Lehman, Hagit Attiya, and Danny Hendler. Recoverable and detectable self-implementations of swap. In Proc. of the 27th International Conference on Principles of Distributed Systems (OPODIS), pages 24:1–24:22, 2023. doi:10.4230/LIPICS.OPODIS.2023.24.
- [37] Nan Li and Wojciech Golab. Detectable sequential specifications for recoverable shared objects. In Proc. of the 35th International Symposium on Distributed Computing, volume 209 of DISC, pages 29:1–29:19, 2021. doi:10.4230/LIPICS.DISC.2021.29.
- [38] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21–65, 1991. doi:10.1145/103727.103729.
- [39] Mohammad Moridi, Erica Wang, Amelia Cui, and Wojciech M. Golab. A closer look at detectable objects for persistent memory. In Proc. of the Workshop on Advanced tools, programming languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systems (ApPLIED), pages 56–64, 2022. doi:10.1145/3524053.3542749.
- [40] Liad Nahum, Hagit Attiya, Ohad Ben-Baruch, and Danny Hendler. Recoverable and detectable fetch&add. In Proc. of the 25th International Conference on Principles of Distributed Systems (OPODIS), pages 29:1–29:17, 2021. doi:10.4230/LIPICS.OPODIS.2021.29.
- [41] Dushyanth Narayanan and Orion Hodson. Whole-system persistence. In Proc. of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 401–410, 2012. doi:10.1145/2150976.2151018.
- [42] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. Memory persistency: Semantics for byte-addressable nonvolatile memory technologies. IEEE Micro, 35(3):125–131, 2015. doi:10.1109/MM.2015.46.
- [43] Andy Rudoff. cascade lake doesn’t support clwb?, 2021. [last accessed 11/04/2024]. URL: https://groups.google.com/g/pmem/c/DRdYIc70RHc/m/rtoP681rAAAJ.
- [44] Andy Rudoff and the Intel PMDK Team. Persistent memory development kit, 2020. [last accessed 2/11/2021]. URL: https://pmem.io/pmdk/.
- [45] Matan Rusanovsky, Hagit Attiya, Ohad Ben-Baruch, Tom Gerby, Danny Hendler, and Pedro Ramalhete. Flat-combining-based persistent data structures for non-volatile memory. In Proc. of the 23rd International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), pages 505–509, 2021. doi:10.1007/978-3-030-91081-5_38.
- [46] Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: lightweight persistent memory. In Proc. of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 91–104, 2011. doi:10.1145/1950365.1950379.
- [47] Yuanhao Wei, Naama Ben-David, Michal Friedman, Guy E. Blelloch, and Erez Petrank. Flit: a library for simple and efficient persistent algorithms. In Proc. of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, pages 309–321, 2022. doi:10.1145/3503221.3508436.
Appendix A Proof of Correctness from Section 4
A.1 Proof of Theorem 1
Proof.
The linearization points defined below in Table 4 establish a strict linearization for any history of the generic construction. Any pending operation that is included in this linearization must be completed with a matching response, and the only applicable case is a pending detectable-FA operation that reaches line 18. The response of such an operation is the value written earlier to at line 11, which would be returned at line 21 if the operation had run to completion.
| Operation | Special case | Linearization point |
|---|---|---|
| detectable-FA | non-trivial state transition applied to at line 18 (i.e., overwritten with new value) | write to at line 18 |
| detectable-FA | trivial state transition applied to at line 18 (i.e., overwritten with same value) and line 20 completed in the same operation | write to at line 18 |
| detectable-FA | remaining cases | n/a (does not take effect) |
| resolve | operation runs to completion | arbitrary point in the operation’s execution |
| resolve | remaining cases | n/a (does not take effect) |
It remains to show that the chosen strict linearization is a legal history, meaning that each operation returns a correct response. Recall that correctness in this context is interpreted with respect to the implemented object type, which is a subset of Moridi et al.’s UDSS [39] applied to a generic FA operation that returns the previous abstract state of the object and applies a state transition according to the user-specified function .
We proceed by first proving a state invariant in Lemma 5 below:
Lemma 5.
Consider a finite history of the generic construction, let be the corresponding strict linearization determined by the linearization points defined in Table 4, and let denote the abstract state of the implemented object at the end of . Recall that is the state of the underlying non-detectable type (in our case FA), identifies the signature (in our case the function ) of the last detectable operation by in , is the tag argument of this operation, and is the operation’s response. Recall also that if does not have any detectable operation in . Then the state is related to the variables of the generic construction at the end of as follows:
-
if then , , and
-
if then , , and
Moreover, (i.e., the abstract value of the implemented object matches the value stored in the variable ).
Correctness of the detectable-FA operation follows from the last clause of the invariant () and the use of the recoverable mutex to protect updates to . The mutex ensures that the value of does not change between when reads at line 11 and when executes the linearization point of the detectable-FA operation at line 18 before returning the value read at line 21, which implies that the return value equals the value of immediately before the linearization point. Next, consider the resolve operation. A completed call to resolve by enforces via line 27, line 32, or line 41, and so Lemma 5 ensures that , , and when reaches line 42. Since ’s resolve operation returns at line 42, the lemma implies that the return value is equal to , as required by the UDSS.
Before proceeding with the proof of Lemma 5, we establish a secondary lemma that captures the effect of the helping mechanism.
Lemma 6.
When a process reaches line 37 of resolve, the value of is the value of the variable as of when last crashed during an execution of detectable-FA.
Proof.
First, note that can only reach line 37 when due to ’s earlier execution of line 33 and branch to line 34. This implies that invoked detectable-FA at least once since can only be established by at line 17, and that crashed during its last invocation of detectable-FA as otherwise it would have reached line 20 and reset back to 0 before returning. Furthermore, this most recent crash occurred after line 17 since Assumption 1 requires to complete a call to resolve after each crash during detectable-FA, and since every complete execution of resolve by resets to 0 at line 27, line 32, or line 41. Since ’s last crash in detectable-FA occurred after line 17, it follows that already completed line 15 and assigned , where the latter condition holds until the point of failure due to the critical section enforced by .
Next, consider how arrived at line 37 of resolve. If at line 35 then is the value read by from at line 34, and in this case no process has overwritten between ’s most recent crash in detectable-FA and ’s execution of line 34. To see this, note that any process that overwrites at line 18 after ’s crash must first acquire the mutex at line 9 after ’s crash, and then execute the helping mechanism at lines 13–14 inside the critical section. The first process that executes the helping mechanism after ’s crash does so with , which ensures that is overwritten with a value different from at line 14. Thus, the lemma holds when at line 35. Otherwise observes at line 35, and is the value read by from at line 36. In this case some process already executed the helping mechanism with before read at line 37, and so at that point already holds the value read by from at line 14. Since executes the helping mechanism after ’s crash and before any process overwrites , as explained earlier, the lemma holds in this case as well.
Finally, we present the proof of Lemma 5 below.
Proof of Lemma 5.
We proceed by induction on the number of steps, , in the history . We will use subscript notation in reference to the value of a program variable or abstract state component at the end of a history, for example denoting the value of at the end of by , or denoting the value of at the end of a linearization of by .
In the base case , and . Then , , and by initialization. Similarly, and by definition of the initial state of the UDSS. The lemma holds since , , , and .
Next, choose an arbitrary and suppose for induction that the lemma holds for any history of length . Fix such a history and let be an extension of by another step by some process . Let and denote the strict linearizations of and , respectively. We proceed by a case analysis on . Steps where accesses , or where merely reads a program variable, are excluded from the analysis since the lemma follows directly from the induction hypothesis in those cases. The same principle applies to crash steps.
Case A.
assigns at line 1 of detectable-FA. It follows from Assumption 1 that can only invoke detectable-FA if this is ’s first operation invocation, or if its last operation was a complete execution of detectable-FA or resolve. Then holds either by initialization or by the action of ’s last operation, and so the induction hypothesis states that , , , and . The lemma requires , , , and since . The lemma follows from the induction hypothesis since maintains , and does not affect any variable aside from .
Case B.
Case C.
assigns at line 5 of detectable-FA. Then holds by ’s earlier execution of line 1. The induction hypothesis states that , , , and . The lemma requires , , , and since . Step maintains and does not affect any variable other than , and so the lemma holds by ’s prior execution of lines 2–4, which copy , and to , and , respectively.
Case D.
updates , or at lines 6–11 of detectable-FA. Then holds by ’s earlier execution of line 5. The induction hypothesis states that , , , and . The lemma requires , , , and since . The lemma follows from the induction hypothesis since maintains , preserves , and does not affect any variable aside from , , and .
Case E.
assigns at line 17 of detectable-FA. Then holds by ’s earlier execution of line 5. The induction hypothesis states that , , , and . The lemma requires , , , and since and since has not yet applied a state transition to at line 18. The lemma follows from the induction hypothesis since maintains and does not affect any variable other than .
Case F.
updates at line 14 or updates at line 15 of detectable-FA. Then holds by ’s earlier execution of line 17. The induction hypothesis states that , , , and since has not yet applied a state transition to at line 18. Similarly, the lemma requires , , , and . The lemma follows from the induction hypothesis since maintains and does not affect any variable other than or .
Case G.
applies a nontrivial state transition to at line 18 of detectable-FA. In this case extends by one detectable-FA operation, , and holds by ’s earlier execution of line 17. The induction hypothesis states that , , , and . The lemma requires , , , and where , , and refer to ’s current detectable-FA operation. It follows from ’s earlier execution of line 6 that , as required. It follows from ’s earlier execution of line 7 that , as required. It follows from ’s earlier execution of line 11 and from the fact that updates to are protected by that , where since is the linearization point of ’s current detectable-FA operation. Thus, holds, as required. Finally, follows by the action of step , which updates to the new abstract value computed via the function , and where the input to is since holds the recoverable mutex while reading and writing at line 18. Thus, the lemma holds.
Case H.
applies a trivial state transition to at line 18 of detectable-FA. In this case , , and holds by ’s earlier execution of line 17. The induction hypothesis states that , , , and . The lemma requires , , , and . The lemma follows from the induction hypothesis since maintains and does not affect any variable aside from , where .
Case I.
Subcase I1.
applied a nontrivial state transition to earlier at line 18. In this case ’s current detectable-FA operation already reached its linearization point prior to step . The induction hypothesis states that , , , and . The lemma follows from the induction hypothesis since maintains and does not affect any variable aside from .
Subcase I2.
applied a trivial state transition to earlier at line 18. In this case ’s current detectable-FA operation takes effect at step , and so extends by one detectable-FA operation. The induction hypothesis states that . The lemma requires , , , and , where , , and refer to ’s current detectable-FA operation. It follows from ’s earlier execution of line 6 that , as required. It follows from ’s earlier execution of line 7 that , as required. It follows from ’s earlier execution of line 11 and from the fact that updates to are protected by that , where since is the linearization point of ’s current detectable-FA. Thus, holds, as required. Finally, follows despite since applied a trivial state transition in its current detectable-FA operation, follows from the induction hypothesis, and follows since does not update . Thus, holds as required, and the lemma holds.
Case J.
Case K.
Case L.
assigns at line 32 of resolve. Then by ’s earlier execution of line 28 and branch to line 29. The induction hypothesis states that , , , and . The lemma requires , , , and . The lemma follows from the induction hypothesis, since maintains and step does not modify , and also from ’s earlier execution of lines 29–31, where restores , , and from , , and .
Case M.
updates , , or at lines 38–40. Then by ’s earlier execution of line 33 and branch to line 34. Furthermore, the value of ’s private variable equals the value of as of when failed during its last execution of detectable-FA by Lemma 6. Then ’s execution of lines 38–40 implies that held at line 37, where is the value fetched from at line 12 during its last detectable-FA operation before assigning at line 17. This implies that never reached line 18, or it did reach this line but applied a trivial state transition to , given our earlier conclusion regarding the value of and the fact that protects lines 12–18. In either case, ’s last detectable-FA operation did not take effect, and so the induction hypothesis states that , , , and . The lemma requires , , , and . The lemma follows from the induction hypothesis since maintains and does not affect any variable aside from , , and .
Subcase N.
assigns at line 41 of resolve. Then by ’s earlier execution of line 33 and branch to line 34. The lemma requires , , , and . If ’s last detectable-FA operation took effect by applying a nontrivial state transition to then the induction hypothesis states that , , , and . In this case the lemma follows from the induction hypothesis since maintains and does not affect any variable other than . Otherwise ’s last detectable-FA operation either applied a trivial state transition to , or did not reach line 18 at all, and hence did not take effect. In this case the induction hypothesis states that , , , and . Now consider ’s earlier comparison of against at line 37, and note that is the value fetched from at line 12 during its last detectable-FA operation before assigning at line 17. Furthermore, the value of ’s private variable equals the value of as of when failed during its last execution of detectable-FA by Lemma 6. Since never reached line 18, or it did reach this line but applied a trivial state transition to , and since protects lines 12–18, it follows that held at line 37. Thus, branched to line 38 and executed lines 38–40 prior to reaching line 41. Given the induction hypothesis, lines 38–40 ensure that , , and by copying , and to , and , respectively. Finally, holds since maintains and does not write . Thus, the lemma holds. This concludes the proof of Theorem 1.
A.2 Proof of Theorem 2
Proof.
Aside from accesses to the recoverable mutex at lines 8–9, line 19, lines 23–25, the operations detectable-FA and resolve are wait-free and have step complexity. Termination and RMR complexity therefore both depend on . This mutex is accessed correctly in the generic construction, meaning that its recovery, entry, and exit sections are called in the correct order, and that Recover is always called after a crash outside the non-critical section due to Assumption 1. Thus, each passage through terminates eventually in any infinite fair history by the starvation freedom of , and incurs RMRs in any history. Since each detectable-FA and resolve operation makes at most one passage through , it follows that the worst-case RMR complexity per operation in the generic construction is .
Appendix B Adapting the Detectable Construction for FCFS
While our generic construction of detectable objects does not preserve wait-freedom due to the use of an RME lock in the detectable-FA operation, there are several concrete ways in which it can guarantee partial wait freedom. Several techniques were discussed already in Section 4, and so we focus here on optimizations related to supporting first-come-first-serve (FCFS) fairness in RME and RME-like algorithms based on our detectable objects.
If the recoverable mutex supports first-come-first-serve (FCFS) (Definition 7) then it can be shown that the detectable-FA operation always appears to take effect inside the doorway of . Such operations are thus wait-free up to their linearization point provided that also provides bounded recovery. Note that steps taken up to the end of the doorway of at line 9 do not by themselves decide whether or not a particular detectable-FA operation will take effect since the latter depends on a process subsequently reaching line 18 or line 20 before crashing (see Table 4 in Appendix A). Thus, the exact order of linearization points is not necessarily determined without some busy-waiting, for example where completes the doorway of before starts the doorway, then waits for to complete the CS of , crashes before starting the CS, finally updates at line 18, and only then it becomes irrevocable that ’s detectable-FA operation takes effect while ’s does not. However, this limitation is not restrictive if the implemented detectable-FA operation is by itself used to enforce FCFS fairness in some recoverable mutex that uses detectable base objects, meaning that the recovery section and doorway of at lines 8–9 become the doorway of . In this specific use case, the relative order of doorway executions only matters under Definition 7 for pairs of processes that actually enter the CS of in their corresponding passages, which occurs only after completing the entry section of failure-free, including the entire detectable-FA operation under consideration; this ensures that the linearization points are well-defined.
Definition 7 (FCFS).
An RME algorithm satisfies first come first serve (FCFS) if for any two passages by distinct processes and , if completes the doorway before starts the doorway and both processes enter the CS in the corresponding passages then enters the CS before .
Theorem 8.
Suppose that supports the FCFS property. Then detectable-FA operations appear to take effect in the doorway of , meaning that for every history of the generic construction, there is a strict linearization such that the total order of operations in is consistent with the partial order of doorway executions corresponding to line 9. Specifically, if holds then is false, where is the total order over operations in , and is the partial order over operations in such that if (and only if) and are both detectable-FA operations and the doorway of is completed in before it is started in .
Proof.
It follows from Theorem 1 that has a strict linearization consistent with the linearization points defined in Table 4 (see Appendix A). Take any such , and let denote the corresponding total order of operations, including both detectable-FA and resolve. Now suppose for contradiction that and both hold for some pair of operations, and so the doorway in was completed before the doorway in was started. It follows from and the FCFS property of that if the CS is entered in both and then it is entered in before . Since each operation in linearizes at a point inside the CS according to Table 4, the actual order of CS executions in implies that , which contradicts our earlier supposition.
A slightly weaker result can be shown if only supports the -FCFS property [25] for some , stated below in Definition 9, which refers to the concepts of failure-concurrency and interfering super-passages. A super-passage by a process is 0-failure-concurrent (0-FC) if crashes in this super-passage. Two super-passages interfere if neither super-passage ends (with a complete failure-free passage) before the other starts. A super-passage is -failure-concurrent (-FC) for some if it interferes with some -FC super-passage (possibly of the same process). A passage is -FC for some if it belongs to a -FC super-passage.
Definition 9 (-FCFS).
An RME algorithm satisfies -FCFS for some if for any two passages by distinct processes and , if completes the doorway before starts the doorway and both processes enter the CS in the corresponding passages then enters the CS before unless at least one of these passages is -FC.
To connect the -FCFS property of to the behavior of our generic construction, we generalize the concepts of super-passages and -FC to object implementations. A super-passage of a process with respect to the generic construction is a maximal contiguous sequence of operations by where at most the last operation in the sequence is complete. For example, a single complete execution of detectable-FA is a super-passage, and similarly for an execution of detectable-FA that is interrupted by a crash followed by one or more executions of resolve where only the last resolve operation is complete. We say that a super-passage of the generic construction by is 0-FC if it contains at least one operation interrupted by ’s crash, which in this context implies that the first operation in the super-passage is the interrupted one. Two super-passages interfere if neither super-passage ends (with a completed operation) before the other starts. A super-passage of the generic construction is -FC for some if it interferes with some -FC super-passage. An operation applied to the generic construction is -FC for some if it belongs to a -FC super-passage.
Theorem 10.
Suppose that supports the -FCFS property for some . Then detectable-FA operations appear to take effect in the doorway of in all super-passages of the generic construction that are not -FC, meaning that for every history of the generic construction, there is a strict linearization such that the total order of operations in is consistent with the partial order of doorway executions corresponding to line 9 in operations that are not -FC. Specifically, if holds then is false for any pair of operations and that are not -FC, where is the total order over operations in , and is the partial order over operations in such that if (and only if) and are both non--FC detectable-FA operations and the doorway of is completed in before the doorway of is started in .
Proof.
We proceed roughly as in the proof of Theorem 8. It follows from Theorem 1 that has a strict linearization consistent with the linearization points defined in Table 4 (see Appendix A). Take any such , and let denote the corresponding total order of operations, including both detectable-FA and resolve. Now suppose for contradiction that and both hold for some pair of operations, and so the doorway in was completed before the doorway in was started, where neither nor is -FC. It follows from and the -FCFS property of that if the CS is entered in both and then it is entered in before . Since each operation in linearizes at a point inside the CS according to Table 4, the actual order of CS executions in implies that , which contradicts our earlier supposition.
Appendix C Bounding Space Complexity in the Detectable Object Construction
The space complexity of the algorithm can be bounded by reclaiming and reusing the queue nodes allocated at line 58. In the absence of failures, each process can maintain a single node and reuse it safely across consecutive passages, but crash failures introduce the risk of an unsafe memory operation where a predecessor process signals its successor redundantly by executing line 71 of Exit from line 57 of Recover, after the successor has already repurposed its queue node for a new passage. This particular use-after-free scenario can be made safe by storing a pointer to the predecessor’s queue node in the field instead of a boolean, similarly to Algorithm 3 in [13]. The domain of values for must also include a designated initial value , as well as a special value indicating that the queue node was already locked and then unlocked. The algorithm is revised slightly in several places: at line 51 of Recover the condition is replaced with ; at line 60 of Enter is initialized to instead of false; at line 64 of Enter is set to prev instead of true; at line 66 of Enter the condition is replaced with ; and at line 71 of Exit the assignment is replaced with . Another potentially unsafe memory access occurs when a successor links with a predecessor at line 65 of Enter from line 52 of Recover, however this is already addressed by the condition at line 51 (now modified to ). The latter ensures that line 52 is reached only if lines 64–65 have not yet been executed, and accounts for the possibility that the predecessor has already moved on to its next passage, in which case may hold but is false since the predecessor has already completed the revised line 71 at least once and updated to .
