Long-lived counters with polylogarithmic amortized step complexity

A shared-memory counter is a widely-used and well-studied concurrent object. It supports two operations: An Inc operation that increases its value by 1 and a Read operation that returns its current value. In Jayanti et al (SIAM J Comput, 30(2), 2000), Jayanti, Tan and Toueg proved a linear lower bound on the worst-case step complexity of obstruction-free implementations, from read-write registers, of a large class of shared objects that includes counters. The lower bound leaves open the question of finding counter implementations with sub-linear amortized step complexity. In this work, we address this gap. We show that n-process, wait-free and linearizable counters can be implemented from read-write registers with O(log2n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\log ^2 n)$$\end{document} amortized step complexity. This is the first counter algorithm from read-write registers that provides sub-linear amortized step complexity in executions of arbitrary length. Since a logarithmic lower bound on the amortized step complexity of obstruction-free counter implementations exists, our upper bound is within a logarithmic factor of the optimal. The worst-case step complexity of the construction remains linear, which is optimal. This is obtained thanks to a new max register construction with O(logn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\log n)$$\end{document} amortized step complexity in executions of arbitrary length in which the value stored in the register does not grow too quickly. We then leverage an existing counter algorithm by Aspnes, Attiya and Censor-Hillel [1] in which we “plug” our max register implementation to show that it remains linearizable while achieving O(log2n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\log ^2 n)$$\end{document} amortized step complexity.


Introduction
A shared-memory counter [24] is a well-studied [3,6,9,12,23] and widely-used concurrent object. A counter stores a non-negative integer and supports two operations: An Inc A preliminary version of this work appeared in DISC' 19 Ben-Gurion University of the Negev, Beersheba, Israel 3 LIS, Aix-Marseille University, Marseille, France operation that increases its value by 1 and a Read operation that returns its current value.
A wait-free counter can be constructed easily by using a single-writer atomic snapshot [2,6,10] object. Such an object allows each process to update its own component (by invoking an Update operation) and to obtain an atomic view of all components (by invoking a Scan operation). To increment the counter, a process p simply increments its component. To read the counter's value, p invokes Scan and returns the sum of all components in the view it obtains. Since wait-free atomic snapshots can be implemented from read-write registers with step complexity linear in the number of processes n [5,19], so can counters.
A well-known result [21] by Jayanti, Tan and Toueg showed this is tight. They prove a linear lower bound on the worst-case step complexity of obstruction-free implementations of a large class of shared objects, including counters, from operations in a set that includes (among some other operations) read and write. In [1], Aspnes, Attiya and Censor-Hillel observed that the lower bound holds only when numerous operations are applied to the object and does not rule out the existence of algorithms whose step complex-ity is sub-linear when the number of operations is bounded. Leveraging this observation, they presented constructions of several data structures for which operations' step complexity is polylogarithmic in n as long as the object's value is polynomial in n, where n is the number of processes. More precisely, they presented a wait-free counter for which the step complexity of Inc operation is O min(log n log v, n) and O min (log v, n) for Read operations, where v is the object's current value. However, the worst-case and amortized step complexities of the counter algorithm of [1] deteriorate as the number of Inc operations increases. For executions in which the number of Inc operations is exponential in n, both the worst-case and the amortized step complexities become the same as those of the snapshot-based algorithm, that is, linear in n. Our contribution The lower bound of [21] leaves open the question of whether there exists a counter algorithm with sublinear amortized step complexity. In this paper, we answer this question in the affirmative, by showing that linearizable and wait-free counters for n processes can be implemented from read-write registers with polylogarithmic amortized step complexity. Intuitively, an implementation is wait-free if every process can complete its operation in finitely many steps, regardless of the behavior of the other processes. This is the first wait-free counter from read-write registers that provides sub-linear amortized step complexity in executions of arbitrary length. We reuse the counter algorithm presented in [1]. Their counter algorithm uses max registers, an object type they introduced and implemented. A max register r supports a WriteMax(r ,v) operation that writes a non-negative integer v to r and a ReadMax(r ) operation that returns the maximum value previously written to r .
We present a novel wait-free deterministic implementation of an unbounded max register and "plug it" into the counter algorithm of [1]. We show that the resulting counter remains linearizable, is wait-free and has O(log 2 n) amortized step complexity. The worst-case step complexity is O(n), which is optimal [21]. Aspnes et al. also presented an unbounded max register, however the step complexities of both ReadMax and WriteMax operations in their algorithm are O min(log v, n) , where v is the object's current value. Thus, executions of arbitrary length can have linear amortized step complexity. Aspnes and Censor-Hiller [4] presented an unbounded max register implementation for which every operation terminates in a constant number of steps with high probability, under the assumption that the max register's value does not grow too quickly. Our implementation of unbounded max register makes a similar assumption. The max register algorithm of [4] is randomized, whereas ours is deterministic. The space complexity of our implementation is unbounded.
Using information-theoretic arguments, Jayanti established a logarithmic lower bound on the worst-case operation step complexity for obstruction-free implementations of a set of one-time objects that includes a fetch&increment object, from operations such as load-linked/store-condition, move and swap [20]. Attiya and Hendler [7] presented lower bounds on the time and space complexities of obstructionfree implementations of several objects from k-word compareand-swap operations. Using as well an information-theoretic argument as well, they proved [7, Theorem 9] a logarithmic lower bound on the amortized step complexity of implementing a one-time fetch&increment object in an obstruction-free manner. Their proof can be modified in a straightforward manner to establish the same result for counters, implying that the amortized step complexity of our algorithm is at most within a logarithmic factor of the optimal. Related work. Counting networks [8], presented by Aspnes, Herlihy and Shavit, allow processes to assign themselves successive values in a given range. They are similar to Batcher's sorting networks [11], except that instead of comparators, they are constructed by interconnecting simple objects called balancers. A balancer has several input and output wires and balances tokens received on its input wires to its output wires. Balancers are typically implemented from registers supporting read-modify-write operations. To obtain a number, processes shepherd tokens from an input wire of the network to an output wire. The step complexity thus depends on the number of traversed balancers, which can be as low as O(log n) [13] for n processes. Counting networks are however, in general, not linearizable. Herlihy, Shavit and Waarts have shown a lower bound of Ω(n) [17] on the depth of n-processes counting networks that are linearizable.
A fetch&increment object stores a non negative integer and supports a single operation that increments the value stored into the object and returns the previous value. It cannot be implemented deterministically in a wait-free manner from read-write registers since its consensus number is 2 [15]. Optimal implementations for n processes from load-link/store conditional objects are presented in [14] by Ellen and Woelfel, in which each fetch & increment operation has O(log n) step complexity. A fast implementation of a counter from compare-and-swap is presented by Khanchandani and Wattenhofer in [22]. The step complexity of Inc is O(log n) and constant for Read. Our counter implementation requires only read/write registers but has O(log 2 n) amortized step complexity for each operation. The algorithms presented in [14,22] have bounded space complexity: The number of base objects they use is bounded by a polynomial function of n. Unlike theirs, our algorithm has unbounded space complexity.
Randomized is another approach to beat the lower bound of [21]. A randomized approximate counter from read-write registers is presented by Aspnes and Censor [3] with step complexity O(( 1 δ log n) O( 1 ) ) for Inc and O(n 4/5+ (( 1 δ log n) O( 1 ) )) for Read, where > 0 is a small constant, n is the number of increments and δ is the approximation ratio the counter achieves with high probability.
As mentioned previously, a deterministic implementation of a counter for n processes from read-write registers is presented in [1]. The step complexities are O(min(log n log v, n)) and O(min(log v, n)) and for Inc and Read operations and for operations respectively. Here, v is the current value of the counter. The algorithm of [1] uses max registers as building blocks. A linearizable implementation of a max register from read-write registers with O(min(log v, n)) step complexity for reading or writing a value v is also given in [1]. In Sect. 3, we present a novel implementation of a max register from read-write registers and show that it achieves for ReadMax and WriteMax operations O(log n) amortized step complexity in executions in which the value stored in the register does not grow too quickly.
Our max register construction shares some similarities with the one of [1]. Both constructions might be seen as binary trees in which internal nodes are switches (a bit stored in a read-write register) and leaves are bounded max register. Specifically, leaves are trivial 0-bounded max-registers in the case of [1] and m-bounded in our case, with m a function of n. Unlike the implementation of [1], our algorithm handles an unbounded number of operations in a wait-free manner without resorting to a snapshot-based implementation. The linearizability proof in [1] is recursive. Such a proof cannot be applied to our algorithm, because, unlike [1], our construction is non-recursive. Also, differently from [1], in our algorithm a process writing a value to the max register accesses only a constant number of nodes. Another difference is that our algorithm employs a helping mechanism so that the max register can be read in a wait-free manner with linear worst-case step complexity.
Unlike [1], correctness and logarithmic amortized step complexity are only guaranteed in executions in which there is a bound on the increments of the value of the max register. We establish in Sect. 4 that the way values change in the max registers used in the counter construction of [1] satisfies this restriction. Hence, by plugging our max register implementation into the construction of [1], we obtain a counter supporting an unbounded number of operations and whose amortized step complexity is polylogarithmic in the number of processes.
Moran et al. [24] defined the notion of a concurrent counter that may assume values from {0, . . . , m − 1}, for some positive integer m, for which increment operations are modulo m. They define and investigate two notions of counters. A static counter guarantees the correctness of increment operations but allows a read operation 1 that is concurrent to an increment operation to return an arbitrary value. The second, stronger, notion is a dynamic counter, for which read operations must be linearizable regardless of whether or not they are concurrent to increment operations. He processes are anonymous in their model. They investigate the space complexity of implementing both static and dynamic counters from binary registers that support either reads and writes only, or stronger read-modify-write operations. Among other results, a wait-free static counter algorithm that uses log m bits and a wait-free dynamic counter algorithm that uses m bits are presented. For both these algorithms, the number of processes that may invoke the increment operation is unbounded. They also present a lower bound on the space complexity of dynamic counters. The counters investigated by our work (as well as by the works we previously described) are dynamic and may assume unbounded values. Unlike [24], our model also assumes that processes have unique identifiers. Organisation The rest of this article is organized as follows. We present the system model we assume and additional required definitions in Sect. 2. In Sect. 3, we present our key technical contribution-an unbounded max register algorithm that guarantees linearizability and logarithmic amortized step complexity when its value is not increased "too quickly". In Sect. 4, we prove that by "plugging" our unbounded max register into the counter algorithm of [1] (instead of using the max register algorithm of [1]) we obtain a linearizable counter with polylogarithmic amortized step complexity. The article is concluded with a discussion in Section 5.

Model and preliminaries
Shared memory We consider a standard asynchronous shared memory system in which a set P of n processes that communicate by accessing shared registers. A register r stores a value from some set and supports two operations: Read(r) which returns the value of the register and Write(r,v) which writes the value v to r.
An implementation of a concurrent object specifies the object's state representation and which algorithms processes follow when they perform operations supported by the object. An execution is a sequence of steps performed by the processes as they follow their algorithms, in each of which a process invokes an operation, returns an operation response, or applies at most a single Read or Write operation to a register (possibly in addition to some local computation). An execution is well-behaved if no process invokes an operation on an object before having received a response from its pre-vious invocation on the same object. In what follows, we consider only well-behaved executions.
The execution interval of an operation starts with the operation's invocation and ends when a response is returned. An operation op precedes another operation op if the execution interval of op ends before the execution interval of op starts. op and op are concurrent if their execution interval intersect. An operation is complete in an execution if it returned a response in the execution. An execution e is linearizable [18] if, for all completed operations in e and some of the uncompleted operations in e, there is a point in the execution interval of the operation, called its linearization point, such that the responses returned are the same as the responses returned if all these operations were executed sequentially following the order of their linearization points. An implementation is linearizable if all its executions are linearizable; it is wait-free [15] if, in every execution, each process completes its operation within a finite number of steps; it is lock-free if, in every execution, at least one process completes its operation after performing a finite number of steps. An implementation is obstruction-free [16] if after any finite execution, any process can complete its operation by taking a bounded number of steps when no other process takes a step. Complexity measure The amortized step complexity is defined as the worst-case (taken over all possible finite executions) average number of steps performed by operations. It measures the performance of an implementation as a whole rather than the performances of individual operations. Indeed, in an execution of a lock-free implementation, some operations may never terminate and the worst-case operation step complexity may thus be unbounded. Amortized step complexity is formally defined as follows. We denote by nsteps(op, e) the number of steps performed by an operation op in e and by OP(e) the set of operations that are invoked in e. The amortized step complexity of an implementation A is then:

Polylogarithmic amortized step complexity max register
The pseudo-code of our unbounded max register is presented in Algorithm 1. Lines in black font constitute a lock-free version of the algorithm, which we describe and analyze in this section. Lines in lighter (metal) color add a helping mechanism that makes the algorithm wait-free. For presentation simplicity, we defer the description of this mechanism to Sect. 3.3. We proceed with a description of Algorithm 1. An UnboundedMaxReg m object M consists of an infinite number of shared bounded MaxReg m max registers, denoted max j , for all j ∈ N 0 . Register max j will be used for representing values in the range [m · j, m · ( j + 1) − 1]. Hence, the subscript m in the type UnboundedMaxReg m refers to the bound m of the bounded max registers used by objects of this type. Each bounded max register max j is associated with a shared switch j bit which is stored in a read/write register. All max registers and their corresponding switches are initialized to 0. Each process i has a local variable last i , storing the index j of the bounded register max j that will be accessed next by i's Read operation. last i is initialized to 0 for each process i.

Algorithm 1
Unbounded Max Register UnboundedMaxReg m , code for process i.

Shared variables:
switch j ∈ {0, 1} : a 1-bit register for each j ∈ N 0 , initially all 0 max j : a MaxReg m object for each j ∈ N 0 , initially all 0 last i ∈ N 0 : smallest index j such that process i has not yet accessed max j , initially 0 H[n] initially all (−1, 0, −1): helping array of integer-triplets, entry i written by process i hCount initially 0: an integer storing the number of times i wrote to if switch k−1 = 0 then 9: hCount ← hCount + 1 10: local c initially 0 15: while switch last i = 0 do 16: if (c mod (n + 2)) = 0 then 18: if (hval ← GetHelp(c)) > 0 then return hval 19: v ← ReadMax(max last i ) 20: The Write function. To write value v, process i first computes the index k of the bounded max register to write to and the residue v to be written to it (Lines 2-3).
Here and in what follows, the residue v of v is the remainder of the division of v by m. Next, i checks in Line 4 whether max k is obsolete. We say that a (bounded) max register is obsolete if its corresponding switch is set, indicating that values were already written to max registers with higher indexes and thus max k should no longer be accessed. If max k is obsolete, i does not need to write to it, so it proceeds to Line 12 for increasing its last index, if required, and returns. Otherwise, max k is not obsolete, so i writes to it the residue v (Line 5). If the max object written to is not the first (Line 6), then i ensures that the previous max object is obsolete (Lines 8-11), updates its last index (Line 12), if required, and returns. The Read function Process i scans the switches in increasing order in Lines 15-16, increasing the value of its last index in the process, until it finds the first non-obsolete bounded max register (this might never happen.). If it does, it reads the maximum residue previously written to that max object (Line 19), adds to the residue a multiple of m corresponding to the index of that max register and returns the sum (Line 20).

Linearizability
The correctness of Algorithm 1 is guaranteed only in executions in which the max register's value is increased in bounded increments. This requirement is formalized by the following definition.

Definition 1 ( -Bounded-increment execution)
Let M be an UnboundedMaxReg object and let e be an execution. e is an -bounded-increment execution for M if for Section 4 presents an n-process unbounded counter implementation that uses UnboundedMaxReg objects. As we prove, all the executions of that implementation are n-bounded-increment executions for all the underlying unbounded max registers.
Let m ≥ n, M be an UnboundedMaxReg m object, implemented by Algorithm 1, and let e be a finite and nbounded-increment execution for M. The next lemma is a direct consequence of Definition 1.

exists a Write operation op that precedes op and whose input v is such that
Proof Let op 0 be a Write operation on M and let v 0 be its input value. Let us assume that v 0 m = k > 0. Since e is an n-bounded increment operation for M, there exists a Write operation op 1 on M whose input v 1 satisfies v 0 −n ≤ v 1 < v 0 that precedes op 0 (Definition 1). In particular, this means that Hence, there must exist a Write operation op that precedes op and whose input v is such that v m = k − 1 (for example, op can be taken as the operation with the smallest index i in the sequence such that . If a switch is set in e, let K be the largest index of the switches that are set in e. Since e is finite, there are finitely many operations that are invoked in e. As each operation on M sets at most one switch, K is well-defined. Otherwise, let K = −1. For each k, 0 ≤ k ≤ K , let s k denote the step in which 1 is written to switch k (at Line 11) for the first time. We observe that switches are set in order: Proof Let k > 0. By the code, the value of switch k is changed to 1 by a Write operation op (at Line 11) on the unbounded max register M whose input v is such that v m = k +1. By Lemma 1, op is preceded by a Write operation op with input value v satisfying v m = k. Since op precedes op and switch k is set for the first time during the execution of op, the value of switch k is 0 when it is read by op (at Line 4). As k > 0, Lines 6-11 are performed by op . In particular, if the value of switch k−1 is not already 1, it is changed to 1 by op at Line 11. It thus follows that s k−1 precedes s k .
We observe that when a Write operation on M whose input v is such that v m = k > 0, the max register max k−1 becomes obsolete before the operation terminates.

Observation 1 Let op be a completed Write operation on
Proof Let op be a completed Write operation on M and let v be its input. Let us assume that k = v m > 0. Since k = v m , switch k is read by op on Line 4. If this read returns 1, the observation follows by Lemma 2. Otherwise, since k > 0, Lines 6-11 are executed during op. In particular, if switch k−1 is not already set (Line 8), 1 is written to it by op in Line 11. Hence switch k−1 is set before the end of op.
We say that a bounded max register max k is active during the interval in which its associated register switch k is not set, but switch k−1 is. We define intervals I 0 , . . . , I K +1 in which the bounded max registers max 0 , . . . , max K +1 are active: -Interval I 0 starts with the beginning of e and ends immediately before s 0 . -For k, 1 ≤ k ≤ K , interval I k begins with s k−1 and ends immediately before s k . -Interval I K +1 begins with s K and ends with the last step of e.
By Lemma 2, these intervals are well defined, in the sense that their beginning precedes their end. Note also that I 0 , . . . , I K +1 form a partition of e. We observe that each Read or Write operation on M accesses at most one of the bounded max register max 0 , max 1 , . . . (in Line 5 for a Write operation, and in Line 19 for a Read operation). For k ≥ 0, let A k be the set of operations on M in e that access the bounded max register max k (by performing a WriteMax in Line 5 in the case of a Write operation or a ReadMax in Line 19 in case of a Read). Let also B be the set of Write operations in e that perform a read of switch k at Line 4, for some k ≥ 0 and that read returns 1.
As e is a finite n-bounded increment execution for M, the sets A k are empty for large enough values of k: Proof The proof is similar to the Proof of Lemma 2. Let k ≥ K + 3 and let us assume towards a contradiction that there is an operation op ∈ A k . Since the largest index of a switch that is set is K , and for a Read operation to access the bounded max register max k , switch k−1 has to be set, op is not a read operation. Hence, op is a Write operation, and as it accesses the bounded max register max k , its input value v satisfies v m = k. By Lemma 1, op is preceded by a Write operation op whose input v satisfies v m = k −1. When op terminates, it follows from Observation 1 that switch k−2 is set. Since k − 2 ≥ K + 1, a switch with index strictly larger than K is set in e, contradicting the definition of K .
Before defining linearization points, we show that each operation in A k performs (at least) some of its steps when the bounded max register max k is active. In what follows, I op denotes the execution interval of operation op.

Lemma 4
For every k, 0 ≤ k ≤ K + 1 and for every operation op ∈ A k , I k ∩ I op = ∅.
Proof Let k, 0 ≤ k ≤ K +1 and let op be an operation in A k . The proof is divided into three cases according to the value of k: k = 0. For op to access the bounded max register max 0 , it has to read 0 from switch 0 (at Line 4 for a Write operation or at Line 15 for a Read operation.). This steps occurs after the beginning of e and before switch 0 is set, as once the value of a switch is changed to 1, it never changes. It thus follows that I op ∩ I 0 = ∅.
is similar to the previous case. op has to read 1 from switch K and 0 from switch K +1 in order to access the bounded max register max K +1 . This latter read occurs between s K and the end of e.
If op is a Write operation that accesses max K +1 and does not terminate, its execution interval intersects I K +1 . Otherwise, as seen in the previous case, when op terminates switch K has been set (by op itself or by another operation). Hence, I op contains s K or a step performed after s K and thus I op ∩ I K = ∅.
To show that M is linearizable in e, we rely on a linearization μ of the ReadMax and WriteMax operations performed on the bounded max registers max 0 , . . . , max K +1 . As linearizability is composable and a linearizable bounded max register can be implemented from read-write registers [1], μ exists.
We next define the linearization points of the operations in 0≤k≤K +1 A k ∪ B: By Lemma 4, the linearization points of all operations op in A k , for k, 0 ≤ k ≤ K + 1, are well defined since the intersection between the execution interval of op and I k is not empty. Hence, each operation in 0≤k≤K +1 A k ∪ B is linearized within its execution interval.

Lemma 5 The set 0≤k≤K +1 A k ∪ B contains every completed operation in e.
Proof If all operations invoked in e are in 0≤k≤K +1 A k ∪B, then clearly the lemma is true. Assume, then, that there are operations that complete in e and are not in the set Let op be such an operation. If op accesses a bounded max register, it belongs to A K +2 by Lemma 3. op cannot be a Read operation, as a Read that accesses max K +2 must read 1 from switch K +1 (Line 15), but this switch is never set in e. op is thus a Write operation. By Observation 1, when a Write operation accessing max K +2 terminates, switch K +1 has been set. As switch K +1 is never set in e, op does not terminate.
It remains to examine the case in which op does not access any bounded max register. If op is a Write operation, since it is not in set B, it has not read a switch in e, or has read 0 from some switch but has not yet performed a WriteMax to the corresponding bounded max register when e ends. In both cases, op does not terminate in e. If op is a Read operation, it has only read 1 from the switches it has accessed in e, or it has read 0 from some switch but has not yet performed a ReadMax to the corresponding bounded max register when e ends. Hence op does not terminate in e.
Finally, we show that the linearization is consistent with the sequential specification of a max register.

Lemma 6
Let op ∈ 0≤k≤K +1 A k be a Read operation that returns a value v.
-If v = 0, there is no Write operation with an input = 0 that is linearized before op. -If v = 0, the largest input value of the Write operations linearized before op is v.
Proof Let op be a Read operation that returns v. Let u, k be integers such that v = k · m + u and 0 ≤ u < m. Since op returns v, it follows from the code that it has performed a ReadMax on the bounded max register max k that returned u (Line 19 and Line 20). Hence, op belongs to A k , for some k ∈ {0, . . . , K + 1}.
If v = 0, we have k = u = 0. Let us assume that there is a Write operation op linearized before op. Let v be the input of that operation. We first note that op cannot belong to B. Indeed, any operation op ∈ B reads 1 from a register switch k , for some k ≥ 0. op is thus linearized after this switch has been set to 1, which occurred after I 0 by Lemma 2 (recall that as op ∈ A 0 , its linearization point is in the interval I 0 ). Hence any operation in B is linearized after op.
op thus belongs to A k for some k ≥ 0. Since every operation in A k , for every k > 0, is linearized after the operations in A 0 , op must be in A 0 . As op is linearized before op, its associated WriteMax operation on max 0 is linearized in μ before the ReadMax performed by op. Since op reads u = 0 from the max register max 0 , the value written by the WriteMax of op is also 0. It thus follows that the input value v of op is v = 0 · m + 0 = 0.
We now consider the case v = 0. We first show that there is a Write operation op with input v that is linearized before op. We then establish that any Write operation with input strictly greater than v (if any) is linearized after op.
1. The first part is divided into two cases, depending of the value of u = v mod m.
u = 0. Note that, as v = 0, k > 0. Thus op performs a ReadMax on the bounded max register max k which returns 0. By the code, this happens after op has read 1 from the register switch k−1 . Let op be the Write operation that sets this switch to 1, and let v be its input value. Before writing 1 to switch k−1 , op has performed a WriteMax on the bounded max register max k (Line 5) with some input value u . Since the ReadMax on max k performed by op returns 0 and follows this WriteMax, Since both op and op perform an operation on max k , they belong to A k . As the WriteMax by op precedes the ReadMax by op, op is linearized before op. u = 0. Recall that u is the value read from the bounded max register max k by op. Hence, there is a WriteMax operation on max k with input u that is linearized before this ReadMax in μ. By the code (Line 5), the WriteMax operation with input u is performed within a Write operation with input v = k·m+u = v. Let op be this operation. Since op accesses max k , it belongs to A k . Since its WriteMax is linearized in μ before the ReadMax of op, op is linearized before op.

Let op be a Write operation with input
it is linearized after switch k has been set to 1. By Lemma 2, this occurs after the interval I k in which op is linearized. Otherwise op ∈ A k . If k > k, op is linearized after op. If k = k, both op and op access the bounded max register max k , and are thus linearized in the interval I k . As v > v, the input u = v mod m written to max k by op is strictly larger that the value u = v mod m read by op. Hence the ReadMax to max k by op must be linearized in μ before the WriteMax performed by op . Thus op is linearized before op . Lemma 7 Algorithm 1 without the helping mechanism is lock-free and Write operations are wait-free.
Proof Write operations perform a single invocation of the wait-free WriteMax operation and a constant number of additional steps, hence they are wait-free. A Read operation may loop forever in Lines 15-16, searching for a non-obsolete max register, but only if Write operations keep making additional max registers obsolete (in Line 11), hence more and more Write operations complete. If no more Write operations complete, each Read operation is guaranteed to complete.

Step Complexity Analysis
The step complexity analysis provided in this section relates to the implementation of Algorithm 1 without the helping mechanism. Recall that OP(e) denotes the set of all operations that are invoked in e. Let OP R (e) (resp. OP W (e)) denote the set of all Read operations (resp. all

Amt Steps(e)
= O w · log m + r · log m + (r + i∈P last i ) w + r .
Assume that max register max k , for k > 0, is accessed in e. Since e is an n-bounded-increment execution and all max j registers are m-bounded, at least m · k n Write operations have been linearized prior to this access. Letting L = max i∈P last i denote the maximum value of all last i variables at the end of e, we get that w ≥ (L − 1) · m/n. Furthermore, i∈P last i ≤ n · L. Thus, The lemma now follows, since r > 0 and m ≥ n 2 hold.

Theorem 1 Algorithm 1 is a linearizable implementation of an unbounded max register with amortized step complexity of O(log m) in any n-bounded-increment execution, if m ≥ n 2 . The algorithm (without the helping mechanism) is lock-free.
Proof For any finite execution e, we define linearization points for a subset (namely, 0≤k≤K +1 A k ∪ B) of the operations on M invoked in e. By Lemma 5, this set includes every operation on M that completes in e. By definition, each linearization point is within the execution interval of the corresponding operation. The values returned in a sequential execution in which the linearized operations are performed in the order of their linearization point is consistent with the semantics of max registers (Lemma 6). Algorithm 1 is thus a linearizable implementation of a max register. By Lemma 7, it is lock-free, and by Lemma 8, its amortized step complexity in n-bounded executions is O(log m), provided that m ≥ n 2 .

The helping mechanism
We now explain the helping mechanism that makes Algorithm 1 wait-free (presented in the metal-colored lines of that algorithm If no suitable value is found in H, the first element of each triplet is used to determine the index of a bounded max register that has not yet been made obsolete by Write operations. The cost of looking for help is O(n) since it consists essentially in reading the n entries of the array. As looking for help is performed every Θ(n) steps, helping does not increase the amortized step complexity of Read operations.
A more detailed explanation follows. Helping is attempted by process i inside Write operations, just before i is about to make another max register obsolete. Specifically, if i just wrote to a max register k > 0 (Lines 5-6), it reads the maximum residue written so far to max k−1 , computes the corresponding value of M based on it and stores it to a local variable curMax (Line 7). If switch k−1 is 0 (Line 8), then max k−1 must be made obsolete. As we prove, in this case, curMax was indeed a value of M at some point during the execution interval of this Write operation. Process i increments hCount in Line 9, helps by writing the triplet of values to H[i] (Line 10), and then sets switch k−1 in Line 11.
The goal of the helping mechanism is to ensure that every Read operation eventually completes. Every n + 2 iterations of the while loop of Lines 15-18, the GetHelp utility function is called, receiving an integer that is a multiple of n + 2, indicating whether or not this is its first invocation by the current Read operation (Line 14, Lines 17-18). If GetHelp returns a positive value then, as we prove, this value was indeed M's value at some point during the execution interval of this Read operation, so it returns this value in Algorithm 2 The GetHelp utility function, code for process i. Line 18. Otherwise, the search for a non-obsolete max register is resumed. The number of iterations n+2 to be performed before looking for help is chosen so that (as we prove) the worst case complexity of Read operations is Θ(n).
The pseudo-code of GetHelp is presented by Algorithm 2, described next. In its first invocation by a Read operation op (performed by some process i), initialization is done by copying the H array to an array H i which is used only by process i (Lines 22-24). In the first invocation, 0 is returned (Line 30), indicating that a maximum value is not yet available. Before returning 0, the loop of Lines 28-29 is performed. In each iteration of this loop, process i increases last i (in Line 29) if the first triplet-value it reads is larger than its current value of last i , establishing that the current value is of an obsolete max register. This is required for bounding the worst-case step complexity of the algorithm.
In each subsequent invocation of GetHelp (Lines 25-27), if any, i checks, for each j, if j updated the second triplet-value of H[ j] at least twice since op was invoked. If this is the case then, as we prove, the last maximum value written by j was indeed M's value at some point during op's execution interval, so GetHelp returns it in Line 27 and then op returns this value in Line 18 of Algorithm 1. If the condition of Line 27 is not satisfied in any iteration, op updates last i (if required) in the loop of Lines 28-29 and returns 0 in Line 30, signifying that i was not helped.

Linearizability and complexity of the full algorithm
In this section we prove that the algorithm with the helping mechanism (henceforth the full algorithm) is linearizable. Let e be a finite execution. We partition Read and Write operations of e as we did in Sect. 3.1, except that now a Read operation may complete without accessing a bounded max register. Indeed, a Read operation may return in Line 18 of Algorithm 1 after being helped. Let us denote the set of such operations by H. Let op r be a Read operation in H by process i that returns value u and let k = u m , then there is a Write operation by a process j = i, concurrent with op r , that wrote u to H[ j] (in Line 10 of Algorithm 1) after performing a ReadMax operation on max k (in Line 7 of Algorithm 1) and op r returns value u after reading it from H[ j] (in Line 27 of GetHelp). We say that op r is associated with that ReadMax operation.
Sets B and A k , k ≥ 0 are defined as in Sect. 3.1, except that for each k ≥ 0, set A k contains in addition the operations in H whose associated ReadMax is performed on max k .
Linearization points are defined in the same way as in Sect. 3.1. Let μ be a linearization of the operations performed in e on the bounded max registers max k , k ≥ 0. Each Write operation op ∈ A k performs a WriteMax on the bounded max register max k at Line 5. We say that op is associated with this WriteMax. Similarly, each Read operation op ∈ (A k \ H) performs a ReadMax on max k at Line 19. We say that op is associated with this ReadMax. Each operation op ∈ A k is thus associated with an operation performed on max k . The linearization point λ op of each operation op ∈ A k is chosen in the interval I k ∩ I op . Furthermore, as for the lockfree algorithm, for any two operations op, op ∈ A k , if in μ the operation op is associated with is linearized before the one op is associated with, we choose λ op and λ op such that λ op precedes λ op .
It is easily verified that Lemmas 1, 2, and Observation 1 hold also for the full algorithm. Recall that K is the largest index of the switches that are set in e. The following lemmas extend Lemmas 3 and 4 to take into account the operations in H.

Lemma 9 For every k
Proof Let op r be a Read operation in H and let v be the value it returned. Let k = v m . op r is thus associated with a ReadMax operation op rm performed on the bounded max register max k . This latter operation is executed by some process j while it is performing a Write operation on M with some input w. From the code (Line 7), we have w m = k + 1. By Lemma 1, this Write operation is preceded by another Write operation op w on M whose input w satisfies w m = k + 1 − 1 = k. Moreover, when op w terminates, we have switch k−1 = 1 (Observation 1). Hence k − 1 ≤ K and the lemma follows.

Lemma 10 Let op ∈ H and let k be the index of the bounded max register max k on which the ReadMax operation it is associated with is performed. I k ∩ I op = ∅.
Proof Let op r be a Read operation on M that returns a value v on Line 18 after having received help. Let i be the process that performs op. i has thus read value v from some entry H [ j] of the shared array H . v is computed from the value u obtained by process j after performing a ReadMax on a bounded max register max k (Line 7). More precisely, we have v = u + k · m, and the ReadMax by j that returns u is the operation op r is associated with. We denote by op rm this ReadMax operation. By the condition on Line 27, j has updated H [ j] at least once since the beginning of op r and before writing v to H [ j]. By the code, op rm thus takes place between j's previous update of H [ j] and before it writes v to H [ j]. Hence, op rm is performed in its entirety within the execution interval of op r .
Since before writing v to H [ j] and after performing op rm , j checks if switch k is still not set (Lines 8-10), op rm completes before s k . If k = 0, op rm takes place in I 0 and hence I op r ∩ I 0 = ∅, as required. Let us assume that k > 0. op rm is executed while j is performing a Write op w on M with some input w, with w m = k + 1. By Lemma 1, this Write is preceded by another Write on M with input w such that w m = k > 0. Hence, by Observation 1, s k−1 occurs before op w starts. Therefore, op rm is performed between s k−1 and before s k . It thus follows that I op r ∩ I k = ∅.
In the full algorithm, operations terminate either after completing Line 12 in the case of a Write operation, Line 20 in the case of a Read operation that does not receive help, or Line 18 in the case of a Read operation that receives help. For the first two cases, Lemma 5 shows that these operations are contained in the set 0≤k≤K +1 A k ∪ B. By definition, H is the set of the Read operations that terminate after having received help. Each operation op in H is also contained in some set A k (k is the index of the bounded max register on which the MaxRead op is associated with is performed). By Lemma 9, for every k ≥ K + 1, no operation in H is contained in A k . Therefore, Lemma 5 holds also for the full algorithm and after operations in H have been included in the sets A k , k ≥ 0.
We now extend Lemma 6 to Read operations that complete after having been helped. If v = 0, k = u = 0. Hence, any Write operation op w linearized before op is contained in A 0 , and the input of the WriteMax performed on max 0 by op w must be 0. It thus follows that the input of op w is 0 · m + 0 = 0.
Otherwise, v = 0. We show (1) that there is a Write operation with input v that is linearized before op, and (2) that any Write operation with input v > v (if any) is linearized after op.
1. We consider two cases depending on the value of u = v mod m.
u = 0. As v = 0, k > 0. By the code, op rm is performed by a Write operation op whose input v is such that v m = k + 1. It follows from Lemma 1 that op is preceded by a Write operation op whose input v satisfies v m = k > 0. By Observation 1, switch k−1 has been set before op terminates, and thus also before op rm is performed. Let op be the Write operation that sets switch k−1 to 1 and let v its input value. Note that v m = k, and by the code, op writes u = v mod m to max k (Line 5) before setting switch k−1 to 1 (on Line 11). op is thus contained in A k and as the WriteMax it performs on max k precedes op rm , it is linearized before op. Moreover, as op rm returns 0, u = 0. Hence, the input of op is v = k · m = v, as required.
u = 0. Since op rm returns u, and the initial value of max k is 0, there is a WriteMax operation op wm on max k with input u that is linearized before op rm . By the code op wm is performed within a Write operation op (on Line 5) whose input is v = k ·m +u = v. op is thus contained in A k and as op wm is linearized before op rm , op is linearized before op.
2. Let op be a Write operation with input v > v, and let k = v m . Note that k ≥ k. If op is contained in B, it is linearized after switch k is set. This occurs after I k (Lemma 2) which is the interval that contains the linearization point of op. Otherwise op ∈ A k . If k > k, op is linearized after op. If k = k , op performs a WriteMax on max k with input u = v mod m.
As v > v, u > u. As op rm is performed on the same bounded max register max k and returns u, op rm is linearized before that WriteMax. Therefore op is linearized after op.

Claim 1 If an infinite and monotonically increasing sequence of values is written to M, then some process performs Line 10 of Algorithm 1 infinitely often.
Proof If an infinite and monotonically-increasing sequence of values is written to M, then max registers are made obsolete infinitely often. Since a max register is only made obsolete in Line 11 of Algorithm 1, it is immediate from the code that Line 10 of that algorithm is performed infinitely often as well. Since the number of processes is finite, it follows that some process performs that line infinitely often.

Lemma 12 The full Algorithm 1 is wait-free.
Proof As proven in Lemma 7, the algorithm is lock-free and Write operations are wait-free. It remains to show that Read operations are wait-free as well. Assume for contradiction that there is an infinite execution e in which a Read operation op takes infinitely many steps. If there is no infinite monotonically-increasing sequence of values that is written to M then, starting from some point in e, M's value does not increase. The set of obsolete max object stops growing, hence op eventually reaches a non-obsolete max register and completes.
Otherwise, there is such a sequence of monotonically increasing values. From Claim 1, there is some process j that performs Line 10 of Algorithm 1 infinitely often. Thus, op eventually evaluates the condition on Line 27 of Algorithm 2 as true and is therefore able to terminate.

Claim 2 Let e be an execution in which max register max k becomes obsolete. Then, after s k , the array H contains a triplet whose first value is at least k.
Proof Immediate from Lemma 2 and from the fact that each process i writes to H[i] (in Line 10) a triplet of values whose first component is k just before writing 1 to switch k−1 (in Line 11).

Claim 3 Let e be an execution, and let s and s be two steps in e between which at least j switches are made obsolete.
Then at least j − 1 different writes to the H array (in Line 10) are performed between the steps s and s .
Proof From Lines 10-11, each Write operation op by a process i writes to H[i] (in Line 10) immediately before writing 1 to switch k−1 (in Line 11). Thus, between two consecutive executions of Line 11 by any process i, i writes to H[i]. Let k be the smallest index of a max register that has not been made obsolete in step s or before. From Lemma 2, the steps s k , . . . , s k+ j−1 in which the max registers k, . . . , k+ j −1 are made obsolete occur in this order in e, and before s . Moreover, any process i that is about to perform Line 11 after s is about to write to switch k or to a switch with a smaller index. It follows that at least a single distinct write to an entry H is done after s j and before s j +1 , for k ≤ j < k + j − 1.
Hence there are at least j − 1 such writes. Proof For any execution e of the full algorithm, we defined linearization points for a subset of the operations that are invoked on M in e. This subset includes every operation on M that completes in e (Lemma 5, which holds for the full algorithm). Lemmas 4 and 10 ensure that each linearization point is within the execution interval of the corresponding operation. The order of the operations induced by their linearization is consistent with the semantic of max registers (Lemmas 6 and 11). The full algorithm is thus linearizable, and wait-free by Lemma 12. It remains to argue regarding its complexity. In Algorithm 2, every iteration of the for loop at either The worst-case step complexity of a Write operation is logarithmic in m, since it applies at most a single WriteMax operation (in Line 5) and at most a single ReadMax operation (in Line 7) plus a constant number of additional steps. As we choose m = n 2 , it is also logarithmic in n. It remains to prove that the worst-case complexity of Read operations is linear.
Let op be a Read operation by process i. We establish that GetHelp is invoked at most twice by op. We do so by proving that after op invokes GetHelp twice, it completes by returning in Line 18.
In the first invocation of GetHelp, let s be the last step performed by i in the first loop (Lines [22][23][24]. Let α denote the smallest index of a bounded max register that has not been set before s. Let α denote the value of last i when this first invocation returns. From Claim 2 and because in the first invocation of GetHelp last i is updated in the loop of Lines 28-29, after s, α ≥ α − 1.
After the first invocation returns, and before invoking GetHelp for the second time, i reads 1 from the n + 2 switches switch α , . . . , switch α +n+1 . Let s denote the first step by i in the second invocation of GetHelp. Because α ≥ α − 1, at least n + 1 bounded registers max α , . . . , max α+n are made obsolete between s and s . It thus follows from Claim 3 that at least n different writes are made to array H between s and s .
Consequently, there exists a process = i that updates H[ ] (in Line 10) at least twice between s and s . It follows from the fact that process increments its second triplet-value before each such update (in Line 9) that, in iteration of the loop of Lines 26-27 of the second iteration of GetHelp, the condition of Line 27 is satisfied. Hence GetHelp returns a non-zero value and op completes in Line 18.
To conclude, we observe that in each invocation of GetHelp O(n) steps are performed. Moreover, between two invocation of GetHelp or between its beginning and its first first invocation of GetHelp (if any), a Read operations performs O(n) steps. As GetHelp is invoked at most twice by any Read operation, the worst-case complexity of Read operations is O(n).

Wait-free counter with polylogarithmic amortized step complexity
Algorithm 3 presents a wait-free recursive construction of a linearizable counter that has polylogarithmic amortized step complexity in all executions, regardless of their length. The algorithm is essentially the same as the (non-recursive) counter construction of Aspnes et al. [1], except that the latter uses the max registers of [1], whose amortized step complexity is linear in sufficiently long executions, whereas ours uses our wait-free unbounded max registers. Let C j denote a counter, shared by n processes, implemented by Algorithm 3. For simplicity and without loss of generality, assume in the following that each of n and j is an integral power of 2. C j 's value is stored in an nprocess wait-free unbounded max register R, which is of type UnboundedMaxReg n 2 . If j > 1 holds, then C j also contains two C j/2 child-counters-left and right. A counter C n serves as a root of a tree of counters and all processes can invoke Inc operations on C n . At the bottom layer of the tree, each process i is associated with a single C 1 leaf-counter on which only i can invoke Inc operations.
To read C j , process i simply invokes a Read operation on C j 's R object and returns the response (Line 11). Incrementing a C 1 object consists of simply reading R and writing to it a value larger by one (Lines 3-11). To increment a C j counter, for j > 1, process i increments either the left or the right child counter, depending on whether its C 1 leaf-counter is on the left or the right subtree of C j , reads the values of both child counters and writes their sum to R (Lines [6][7][8][9]. Observe that at most j distinct processes can invoke Inc operations on any specific C j counter.
In the following proofs we let C denote a C n object implemented by Algorithm 3 and e be an execution of C. Algorithm 3 An n-process counter C j , code for process i.

Lemma 13 The C j counter implementation of Algorithm 3 is linearizable.
Proof The proof is by induction on j. Base case For j = 1, the UnboundedMaxReg object R of a C 1 counter may only be incremented by a single process.
Since R's value is always increased by exactly 1, the execution is 1-bounded-increment for R, so the correctness of R follows from Theorem 2. Increment operations on C 1 are linearized when the Write operation invoked in Line 11 is linearized and read operations on C 1 are linearized when the Read operation invoked in Line 11 is linearized. Induction Hypothesis For all k < j, C k is a linearizable counter and the value of its max object R is never increased by a Write operation by more than k.

Inductive Step
Lemma 14 e is a j-bounded-increment execution for C j .R.
Proof The proof is divided into two parts. We first prove the left-hand inequality of Definition 1. Let e be a prefix of e immediately after which process p is about to invoke a Write() operation op v on C j .R with input v (in Line 9). Let I be the set of Inc operations that have completed on C j in e . Observe that each operation op ∈ I has performed one Inc operation on either C j .left or C j .right. We partition I accordingly: I = I 0 ∪ I 1 , where for any op ∈ I, op ∈ I 0 if op performed an Inc operation on C j .left and op ∈ I 1 if op performed an Inc operation on C j .right. By IH, both C j .left and C j .right are linearizable counters. Let op 0 ∈ I 0 be the operation whose Inc operation on C j .left is linearized last among all Inc operations on C j .left performed by the operations in I 0 . Let c 0 be the value of C j .left immediately after the Inc operation on that object by op 0 . op 1 and c 1 are defined similarly. From Lines 7-9, for each r ∈ {0, 1}, after performing an Inc operation on either C j .left or C j .right, op r performs read operations on both C j .left and C j .right before writ-ing the sum u r of the values read to C j .R. We show that v = max{u 0 , u 1 } ≥ c 0 + c 1 . Indeed, assume that op 0 's read operation on C j .right returns a value strictly smaller than c 1 (note that, otherwise, u 0 ≥ c0 + c1). Then, op 1 's Inc operation on C j .right is linearized after op 0 's Read operation on C j .right. It thus follows that op 1 's read operation on C j .left starts after op 0 's Inc operation on C j .left has completed. We thus conclude that u 1 ≥ c 0 + c 1 .
Since both op 0 and op 1 have completed in e , a WriteMax operation on R of value v ≥ c 0 + c 1 has completed in e .
If v ≤ v then v − j ≤ v and the claim holds. Otherwise, again from Lines 7-9, the operand v of the WriteMax operation op v is the sum of the values v 0 , v 1 returned by the Read operations performed on the counters C j .left and C j .right, respectively. v 0 = c 0 + δ, for δ > 0, implies that there are δ Inc operations on C j .left that have been linearized after the Inc operation on the same counter by op 0 . From the definition of op 0 , these δ operations take place within δ Inc operations on C j that did not complete in e . The same argument applies for v 1 . Since there are at most j processes that may invoke Inc operations on C j and thus at most j incomplete Inc operations on C j after e , it follows that We next prove both inequalities of Definition 1. Let op be a WriteMax operation on C j .R with input v > j. The first part above established that there exists a WriteMax operation op on C j .R with input v that finishes before op starts, such that v − j ≤ v . Assume that v ≥ v. Let O > be the set of WriteMax operations on C j .R that (1) precede op and (2) whose input is larger than or equal to v. We define a partial order ≺ on the operations in O > as follows: Let us observe that O > is non-empty and finite. The latter is because e is finite and so only finitely many operations precede op in e and the former follows from the existence of op . Consider any minimal element in the partially ordered set O > , that is any operation W such that for any operation W ∈ O > , W does not precede W . Since O > is finite, there is at least one such operation W . Let in W denote its input. Since W ∈ O > , we have in W ≥ v. Also, by applying the left-hand inequality (proved in the first part of the proof) to W , there exists an operation W with input in W that precedes W such that in W ≥ in W − j ≥ v − j. As W ≺ W , and W is chosen as a minimal element of O > , it follows that W / ∈ O > . Since W precedes both W and op, we get that in W < v, which concludes the proof.
From Lemma 14 and Theorem 2 we conclude that C n .R is linearizable in e. Based on this, the proof proceeds similarly to the proof of [1, Lemma 4].
From IH, the counters C j .left and C j .right are linearizable. We associate with every increment operation op on C j a value as follows. Let c 0 and c 1 respectively denote the values of C j .left and C j .right immediately after p's increment of C j 's child (corresponding to p's identifer), in Line 6, is linearized. Then we associate with op the value v = c 0 + c 1 . We linearize an I nc operation op, associated with value v, when a value v ≥ v is first written to C j .R in Line 9 (either by p or by another process). We linearize a Read operation on C j when it reads C j .R in Line 11.
We now prove that each linearization point is within its operation execution interval. Consider an Inc operation op associated with value v. A value v ≥ v cannot be written to C j .R before op starts, because, from the linearizability of C j .left and C j .right, before op starts, the sum of these two counters is less than c 0 + c 1 . Since op itself writes value v to C j .R before it terminates, the linearization point occurs before op terminates. The fact that the linearization point of a Read operation on C j lies within its execution interval follows immediately from the linearizability of C j .R, established by Lemma 14. Finally, the linearization points result in a valid sequential execution, because every Read operation on C j that returns value v is preceded by exactly v Inc operations on C j .

Lemma 15 Algorithm 3 has O(log 2 n) amortized step complexity.
Proof From Algorithm 3 and the fact that C is shared by n processes, every operation on C applies a constant number of ReadMax/WriteMax operations to each of O(log n) different UnboundedMaxReg objects, as the recursive calls in Lines 7-9 and 11 unfold. Letting ncops(e) denote the number of operations that are invoked on C in e, the total number of ReadMax/WriteMax operations on all the implementation's UnboundedMaxReg objects is therefore O log n · ncops(e) . From Theorem 2, letting m = n 2 , it follows that the total number of steps performed in e is O log 2 n · ncops(e) .

Lemma 16 for both Inc and Read operations, Algorithm 3 has O(n) worst-case step complexity.
Proof A Read operation on the counter invokes a single Read on an UnboundedMaxReg n 2 object hence, from Theorem 2, its step complexity is O(n). It remains to argue about the worst-case step complexity of Inc operations.
Algorithm 3 uses at its root (layer 0 of the counters tree) an unbounded max register for n processes. More generally, each tree layer i ∈ {0, . . . , log n}, consists of unbounded max registers for n 2 i processes. Unfolding the recursion of Algorithm 3, an Inc operation I on the root results in the following operations.
-At each of the tree layers, a single Write operation is applied to a single UnboundedMaxReg object. From Theorem 2, the total worst-case step complexity of all these Write operations is O(log 2 n). -At most two Read operations are applied to a single UnboundedMaxReg object at each of the tree layers. From Theorem 2, there is a constant c such that the number of steps taken by a Read operation on an UnboundedMaxReg object for j processes is at most c · j. Thus the total number of steps incurred by all the Read operations triggered by I is at most j∈{0,...,log n} 2c · 2 j = O(n).

Theorem 3 Algorithm 3 is a wait-free linearizable n-process implementation of an unbounded counter with amortized step complexity of O(log 2 n) and worst-case step complexity of O(n).
Proof From Lemma 13, the algorithm is linearizable. By Lemma 12, all the UnboundedMaxReg objects used by Algorithm 3 are wait-free. Therefore, clearly from the pseudo-code, Algorithm 3 is wait-free as well. The claimed amortized and worst-case complexities follow from Lemmas 15 and 16 respectively.
A logarithmic lower bound on the amortized step complexity of implementing an obstruction-free one-time fetch&increment object from read, write and k-word compare-and-swap operations was proved by Attiya and Hendler in [7,Theorem 9]. Their proof can be easily adapted to obtain the following result: Lemma 17 Any n-process obstruction-free implementation from read/write registers of a counter object has an execution that contains Ω(n log n) steps, in which every process performs a single Inc operation followed by a single Read operation.
Lemma 17 shows that every lock-free read/write counter implementation has an execution whose amortized step complexity is at least logarithmic in the number of processes, showing that our counter algorithm is optimal in terms of amortized step complexity up to a logarithmic factor.

Discussion
In this work, we presented the first lock-free read/write counter algorithm that provides sub-linear amortized step complexity in all executions, regardless of their length. The amortized step complexity of our algorithm is O(log 2 n), where n is the number of processes sharing the implementation. This is optimal up to a logarithmic factor, since there exists a logarithmic lower bound on the amortized step complexity of n-process one-time counters. In contrast, the amortized step complexity of the counter algorithm of [1] deteriorates as the number of Inc operations increases and eventually becomes linear in n.
It is unclear whether there exists a wait-free (or even lockfree or obstruction-free) read/write counter implementation with o(log 2 n) amortized step complexity. Interestingly, a similar gap between an O(log 2 n) upper bound and an Ω(log n) lower bound exists for the worst-case step complexity of counters [1].
The space complexity of our counter is infinite, since it uses our unbounded max registers, and each of these encapsulates an infinite number of bounded max registers. Finding a bounded-space read/write counter with sub-linear amortized step complexity is another open question. These questions are left for future work.