Abstract 1 Introduction 2 The Buffered Persistent Data Structure 3 Worst-Case Bounds 4 Discussion and Open Problems References

Buffered Partially-Persistent External-Memory Search Trees

Gerth Stølting Brodal ORCID Aarhus University, Denmark Casper Moldrup Rysgaard ORCID Aarhus University, Denmark Rolf Svenning ORCID Aarhus University, Denmark
Abstract

We present an optimal partially-persistent external-memory search tree with amortized I/O bounds matching those achieved by the non-persistent Bε-tree by Brodal and Fagerberg [SODA 2003]. In a partially-persistent data structure, each update creates a new version. All past versions can be queried, but only the current version can be updated. Operations should be efficient with respect to the size Nv of the accessed version v. For any parameter 0<ε<1, our data structure supports insertions and deletions in amortized 𝒪(1εB1εlogBNv) I/Os, where B is the external-memory block size. It also supports successor and range reporting queries in amortized 𝒪(1εlogBNv+K/B) I/Os, where K is the number of keys reported. The space usage of the data structure is linear in the total number of updates. We make the standard and minimal assumption that the internal memory has size M2B. The previous state-of-the-art external-memory partially-persistent search tree by Arge, Danner and Teh [JEA 2003] supports all operations in worst-case 𝒪(logBNv+K/B) I/Os, matching the bounds achieved by the classical B-tree by Bayer and McCreight [Acta Informatica 1972]. Our data structure successfully combines buffering updates with partial persistence. The I/O bounds can also be achieved in the worst-case sense, by slightly modifying our data structure and under the requirement that the memory size M=Ω(B1εlog2(maxvNv)). For updates, where the I/O bound is o(1), we assume that the I/Os are performed evenly spread out among the updates (by performing buffer-overflows incrementally). The worst-case result slightly improves the memory requirement over the previous ephemeral external-memory dictionary by Das, Iacono, and Nekrich (ISAAC 2022), who achieved matching worst-case I/O bounds but required M=Ω(BlogBN), where N is the size of the current dictionary.

Keywords and phrases:
B-tree, buffered updates, partial persistence, external memory
Copyright and License:
[Uncaptioned image] © Gerth Stølting Brodal, Casper Moldrup Rysgaard, and Rolf Svenning; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data structures design and analysis
Related Version:
Full Version: https://arxiv.org/abs/2503.08211 [13]
Funding:
All authors are funded by Independent Research Fund Denmark, grant 9131-00113B.
Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz Herman

1 Introduction

Developing data structures for storing a set of keys from a totally ordered set subject to insertions, deletions, successor and predecessor queries, and range reporting queries is a fundamental problem in computer science. The classical solution in external-memory is the B-tree by Bayer and McCreight [5] which supports all the operations in worst-case 𝒪(logBN+K/B) I/Os, where N is the current size of the set, K is the number of reported keys, and B is the external-memory block size. While the B-tree achieves the optimal number of I/Os for queries, for any 0<ε<1, the Bε-tree by Brodal and Fagerberg [10] significantly improves update efficiency by attaching buffers to the internal nodes of a B-tree. This design supports updates with amortized 𝒪(1εB1εlogBN) I/Os. The εB1ε factor improvement over traditional B-trees is significant when considering typical parameters of, e.g., ε=1/2 and B=1000 [4] and the Bε-tree has found important applications in high-performance industry software such as TokuDB [9] and BetrFS [25].

Although the Bε-tree optimizes update efficiency, it is ephemeral, like most dynamic data structures, meaning that each update overwrites the previous version, and only the current version can be queried. In many applications, maintaining access to past versions is beneficial or even essential. A persistent data structure supports such accesses, and in their seminal 1989 paper, Driscoll, Sarnak, Sleator, and Tarjan introduced general techniques for making ephemeral data structures persistent [20]. A partially-persistent data structure supports queries in all past versions of the data structure but only the current version can be updated. Multiple authors have adapted these techniques to the external-memory model, developing partially-persistent B-trees that support updates and queries in worst-case 𝒪(logBNv+K/B) I/Os, where Nv is size of the accessed version v, matching the performance of classical B-trees [3, 6, 32].

In this paper, we present the first buffered partially-persistent external-memory search tree that retains the optimal update and query performance of the (ephemeral) Bε-tree. Our approach combines buffering techniques, which are essential for efficient updates in external memory, with a geometric view of persistence.

1.1 The External-Memory Model

For problems on massive amounts of data that do not fit in internal memory, the standard model of computation is the I/O-model by Aggarwal and Vitter [1]. In this model, all computation occurs in an internal memory of size M, while an infinite external memory is used for storage. Data is transferred between internal and external memory in blocks of B consecutive elements, with each transfer counting as an I/O. The I/O complexity of an algorithm is defined as the total number of I/Os it performs, and the space usage is the maximum number of external-memory blocks used at any given time. The only operation we allow on stored keys are comparisons and we follow the standard assumption that the parameters B2 and M2B. Aggarwal and Vitter proved that the optimal bound for sorting in external memory is 𝑠𝑜𝑟𝑡(N)=Θ(NBlogM/BNB) I/Os [1]. An algorithm is called cache oblivious if it is designed without explicit knowledge of B and M but is still analyzed in the I/O model for arbitrary values of these parameters, assuming an optimal offline cache replacement strategy [22]. Some authors make stronger assumptions on the size of the internal memory, such as the tall-cache assumption MB1+δ, for some constant δ>0. For cache-oblivious algorithms, a tall-cache assumption is necessary to achieve optimal comparison-based external-memory sorting [11].

Considering the I/O-behavior of algorithms can be crucial in practice, as demonstrated by Streaming B-trees [8], the generation of massive graphs for the LFR benchmark [24, 27, 28], and the FlashAttention algorithm used in Transformer models [15].

1.2 Interface of a Partially-Persistent Search Tree

A partially-persistent search tree stores an ordered set of keys supporting the interface below (in our examples we use integers, but our data structure works for any totally ordered set). Each version is identified by a unique integer version identifier v, with zero being the initial version and the current version denoted by vc. Further, we let 𝒮v denote the set of keys contained in version v, and Nv the size of 𝒮v. Initially vc=0 and 𝒮vc=. Updates (insertions and deletions) can only be performed on the current set 𝒮vc, and any update advances the current version identifier, i.e., each version of the set 𝒮v only differs from the previous version 𝒮v1 by at most a single key. Queries can be performed on any version.

  • Insert(x) Creates 𝒮vc+1=𝒮vc{x}, increments vc, and returns vc.

  • Delete(x) Creates 𝒮vc+1=𝒮vc{x}, increments vc, and returns vc.

  • Range(v,x,y) Reports all keys in 𝒮v[x,y] in increasing order.

  • Search(v,x) Returns the successor of x in 𝒮v, i.e., min{y𝒮v|xy}.

1.3 Previous Work

In internal memory, the fat node and node copying techniques can make any ephemeral linked data structure partially-persistent with constant overhead in both time and space, if the in-degree of each node in the ephemeral structure is constant [20]. Becker, Gschwind, Ohler, Seeger, and Widmayer [6] and Varman and Verma [32] adapted these techniques to B-trees in external-memory. An elegant application of partial persistence appears in the design of linear space planar point location data structures [31]. In this setting, the underlying set consists of segments which are partially ordered (only a pair of segments intersected by a vertical line can be compared). To adapt this approach to the external-memory setting, Arge, Danner, and Teh strengthened the partially-persistent B-tree to require only a total order on keys alive at any given version, leading to a static external-memory point-location structure [3].

A different approach to persistence is to interpret it geometrically, see Figure 1(left), modeling it as a data structure problem on a dynamic set of vertical segments. Kolovson and Stonebraker explored this perspective [26], though their reliance on R-trees led to poor performance guarantees [23]. More recently, Brodal, Rysgaard, and Svenning [12] leveraged this geometric approach to develop fully persistent B-trees, which allow both queries and modifications to all past versions in 𝒪(logBNv) I/Os. In a fully persistent data structure, updating a version corresponds to cloning it and then applying the modification to the newly cloned version, ensuring that existing versions remain unaffected. Such behavior contrasts with retroactive data structures [17], where updates recursively propagate to cloned versions.

Concurrently with the work on persistent data structures in external-memory, there were significant improvements to external-memory data structures by leveraging buffering techniques to always process multiple updates and/or queries together. These include the Buffer Tree by Arge [2] which can form the basis for external-memory sorting, priority queues and batched dynamic algorithms [21] in amortized 𝒪(1BlogM/BNB) I/Os per operation. For a batched operation the answer might not be immediately returned, which is often sufficient, e.g., in many geometric plane-sweep algorithms where only the end result matters. For standard (non-batched) data structures, a line of work has investigated the update-query trade-off, beginning with the Buffered Repository Tree [14] performing updates in amortized 𝒪(1BlogBN) I/Os and queries in 𝒪(log2N) I/Os. This was later generalized by the Bε-tree [10] which corresponds to the Buffered Repository Tree for ε0 and to the standard B-tree for ε1. The amortized performance of the Bε-tree was improved to high-probability [7] and worst-case [16] I/O bounds using stronger assumptions on the size of B and M (see Table 1).

Table 1: Overview of results on the I/O complexity of ephemeral and partially-persistent search trees. Results marked by “am.” hold amortized, and results marked by “rand.” are randomized and hold with high probability. All other bounds are worst case. The parameter ε must satisfy 0<ε<1. All results assume M=Ω(B), further assumes B=Ω(logN) and M=Ω(max{BlogΘ(1)N,B2}); assumes M=Ω(BlogBN); and assumes M=Ω(B1εlog2(maxvNv)). For both queries and updates in [7, 16], we include the multiplicative dependency on 1ε (that can be omitted when treating ε as a constant), allowing, for example, setting ε=1log2B. All ephemeral results use space linear in N and all partial persistence results use space linear in the total number of updates.
Range Query Update
Ephemeral
Bayer and McCreigh [5] 𝒪(logBN+K/B) 𝒪(logBN)
Brodal and Fagerberg [10] 𝒪(1εlogBN+K/B) am. 𝒪(1εB1εlogBN) am.
Bender, Das, Farach-Colton, Johnson, and Kuszmaul [7] 𝒪(1εlogBN+K/B) 𝒪(1εB1εlogBN) rand.
Das, Iacono, and Nekrich [16] 𝒪(1εlogBN+K/B) 𝒪(1εB1εlogBN)
Partial Persistent
Becker, Gschwind, Ohler, Seeger, and Widmayer [6] } 𝒪(logBNv+K/B) 𝒪(logBNv)
Varman and Verma [32]
Arge, Danner, and Teh [3]
This paper (Theorem 1) 𝒪(1εlogBNv+K/B) am. 𝒪(1εB1εlogBNv) am.
This paper (Theorem 2) 𝒪(1εlogBNv+K/B) 𝒪(1εB1εlogBNv)
This paper (Theorem 3) 𝒪(1εlogBNv+γ+K/B) 𝒪(1B1ε(1εlogBNv+γ))
γ=𝑠𝑜𝑟𝑡(B1εlog2Nv)=log2NvBεlogM/Blog2NvBε

1.4 Contribution

Combining the two lines of research on persistence and buffered data structures has remained an open challenge for the past 20 years, likely due to their seemingly conflicting principles. Persistence requires maintaining access to past versions without affecting their structure, while buffers essentially hold updates to past versions before applying them. Our work demonstrates that these two ideas can be effectively unified by developing partially-persistent external-memory search trees that achieve bounds matching those of ephemeral Bε-trees. In Section 2 we prove the following theorem.

Theorem 1.

Given any parameter 0<ε<1 and M2B, there exist partially-persistent external-memory search trees over any totally ordered set, that support Insert and Delete in amortized 𝒪(1εB1εlogBNv) I/Os, Search in amortized 𝒪(1εlogBNv) I/Os, and Range in amortized 𝒪(1εlogBNv+K/B) I/Os. Here Nv denotes the number of keys contained in version v, and K the number of keys reported by Range. The space usage is linear in the total number of updates.

Our construction is essentially a Bε-tree with buffers of 𝒪(B) delayed updates at each internal node, leaves storing Θ(1εBlogBNv) updates, and where partially persistence is obtained using path copying of nodes, where an internal node is only copied when its buffer is empty.

The query Search can trivially also answer a member query “x𝒮v?” by checking if Search(v,x) returns x. Our data structure also supports predecessor queries along with successor queries, as well as strict predecessor and successor queries, i.e., the returned key should be strictly smaller or larger than the query key x. The structure can also handle the case when 𝒮0, where the initial structure can be constructed using 𝒪(1+|𝒮0|/B) I/Os (essentially this is Section 2.6, where a structure is constructed for a given sorted set of keys). Our data structure is stated as maintaining a set of keys, but it can easily be extended to support dictionaries storing key-value pairs (each vertical segment in Figure 1 now stores a key-value pair, where the first axis is the key. Changing the value for key x at version v starts a new vertical segment at (x,v) with the new value).

In Section 3 we describe how to convert the amortized I/O bounds of Theorem 1 to worst-case bounds under the assumption that M=Ω(B1εlog2(maxvNv)) (Theorem 2), a weaker or equal assumption on the memory size than used in [7] and [16] for high-probability and worst-case bounds for Bε-trees, respectively. Under the weakest assumption that M2B, we achieve the worst-case bounds in Theorem 3 with an additional term of at most 𝒪(𝑠𝑜𝑟𝑡(B1εlog2Nv)) I/Os, where B1εlog2Nv is an upper bound on the number of buffered updates on a root-to-leaf path in our buffered Bε-tree that should be flushed to a leaf. For updates, where the I/O bound can be o(1), we assume that the I/Os are performed evenly spread out among the updates.

Theorem 2.

Given any parameter 0<ε<1 and M=Ω(B1εlog2(maxvNv)), there exist partially-persistent external-memory search trees over any totally ordered set, that support Insert and Delete in worst-case 𝒪(1εB1εlogBNv) I/Os, Search in worst-case 𝒪(1εlogBNv) I/Os, and Range in worst-case 𝒪(1εlogBNv+K/B) I/Os. Here Nv denotes the number of keys contained in version v, and K the number of keys reported by Range. The space usage is linear in the total number of updates.

Theorem 3.

Given any parameter 0<ε<1 and M2B, there exist partially-persistent external-memory search trees over any totally ordered set, that support Insert and Delete in worst-case 𝒪(1B1ε(1εlogBNv+γ)) I/Os, Search in worst-case 𝒪(1εlogBNv+γ) I/Os, and Range in worst-case 𝒪(1εlogBNv+γ+K/B) I/Os, where γ=𝑠𝑜𝑟𝑡(B1εlog2Nv). Here Nv denotes the number of keys contained in version v, and K the number of keys reported by Range. The space usage is linear in the total number of updates.

Note that, for example, when Nv=2𝒪(Bε) then B1εlog2Nv=𝒪(B) and γ=𝒪(1) and the I/O bounds of Theorem 3 match those of Theorem 2, with only the assumption M2B. This observation can be further strengthened, as when γ=𝒪(1εlogBNv) the I/O bounds match similarly, which holds when Nv=2Bε(MB)𝒪(Bεεlog2B).

Outline of Data Structure.

Figure 1: (Left) A list of updates performed on an initially empty set and the geometric interpretation of the updates. Vertical lines illustrate the half-open intervals of versions containing a key. Note that the key 3 is contained in versions [8,10[, i.e., versions 8 and 9, whereas the key 2 is contained in version 6 and versions [12,[. The topmost dashed line shows that version 11 of the set is 𝒮11={1,4,5}, the dashed line segment at version 9 shows that the result of the query Range(9,2,7) is {3,4,5,6}, and the bottommost dashed arrow shows that the result of the successor search Search(4,2) is 6. (Right) A Bε-tree of the open rectangles and below the geometric interpretation of the updates, split into multiple rectangles, where gray rectangles are open rectangles. The black endpoints represent updates present in the rectangles and the gray endpoints represent buffered updates present in the buffers at the internal nodes of the Bε-tree. A query Range(v,x,y) is represented as the dashed line between two square endpoints, spanning rectangles r1, r2, r3, and r4, and a successor query Search(w,z) is represented as the dashed arrow from a square endpoint, spanning two rectangles r5 and r6. Black dots on vertical segments correspond to the upper endpoint of the segment in the rectangle below and the lower endpoint of the segment in the rectangle above.

Previous work on partially-persistent search trees in external memory directly adapted the general pointer-based transformations for persistence [20]. In contrast, our approach embraces the geometric interpretation of partial persistence like in [12], where the state of the data structure is embedded in a two-dimensional plane with keys on the first axis and versions on the second axis, see Figure 1(left). Under this interpretation, each update corresponds to the start or end of a vertical segment in the plane. Since partially-persistent updates are applied to the current version, it always affects the top of the plane. Successor and predecessor queries correspond to horizontal ray shooting to the right and left, respectively, and range queries to reporting the intersections between a horizontal query segment among vertical segments.

To efficiently update and query the geometric view, we partition the plane into rectangles, each containing Θ(1εBlogBN¯) vertical segments in lexicographic order. For now, we assume that all versions have size Θ(N¯), for a fixed N¯. This assumption is lifted using global rebuilding, see Section 2.6. In the geometric persistent view, a vertical segment crossing multiple rectangles is split into multiple smaller segments, one for each rectangle, and each smaller segment is inserted into one rectangle.

At a high level, our data structure is divided into two parts, see Figure 1(right). The top part consists of all the open rectangles containing the current set 𝒮vc, which may still be updated. The entry point of this data structure is a Bε-tree T on the key axis to facilitate buffered updates and to find the relevant rectangle(s) for updates and queries. Since updates are buffered, the geometric view stored in the rectangles may be incomplete, since buffered updates (segment endpoints) will first be added to the rectangle when buffers are flushed. The bottom part consists of all the finalized rectangles, i.e., rectangles which can be queried but not updated. The entry point to the bottom part is a point location data structure P to find the relevant rectangle(s) for a query. We implement P as an external-memory adaption of the classical planar point location solution using partial persistence [3, 31], more specifically a B-tree with path copying during updates.

2 The Buffered Persistent Data Structure

In this section, we describe our partially-persistent Bε-tree structure. Versions are identified by the integers 0,1,2,, where vc denotes the identifier of the current version. We let 𝒮v denote the set at version v, where keys are from some totally ordered set. The initial set 𝒮0=, and 𝒮v+1=𝒮v{x} if the v+1’th update is Insert(x), and 𝒮v+1=𝒮v{x} if the v+1’th update is Delete(x). Note that 𝒮v+1=𝒮v if the (v+1)’th update inserts a key already in 𝒮v or is deleting a key not in 𝒮v.

2.1 Geometric Interpretation of Partial Persistence

The problem has a natural geometric interpretation in a two-dimensional space, with the first dimension representing the keys and the second dimension representing the versions, see Figure 1(left). On this two-dimensional plane, a key x existing in a half-open interval of versions [v,w[, is represented by the vertical line segment {x}×[v,w[, i.e., x is inserted in version v and deleted in version w. If x𝒮vc, then w=+ (x has not been deleted yet).

2.2 Partitioning the Plane into Rectangles

We consider the sequence of versions partitioned into intervals [v0,v1[,,[vk1,vk[,[vk,[, for some versions 0=v0<v1<<vkvc. In Section 2.6 we show how to maintain the version intervals. We let v¯=vk and N¯=|𝒮v¯|. In the following, we consider the interval [v¯,[, which contains the current version vc of the set. We allow up to αN¯ partially-persistent updates during this interval of versions for a constant 0<α<1, i.e., all versions v, v¯vvc, satisfy (1α)N¯|𝒮v|(1+α)N¯.

Our data structure is built around four parameters:

Δ=BεH=1+logΔN¯F=B1εR=H2ΔF

The basic idea is to have a Bε-tree T of degree at most 2Δ1 (and degree at least Δ, if only insertions can be performed, and degree at least 1 if deletions are allowed), where leaves (open rectangles) store between 4R and 10R updates (see Section 2.4) and all leaves are at the same level of T. In Section 2.5 we prove that H is an upper bound on the height of T (number of nodes on a root-to-leaf path, excluding the leaves). Each internal node of T will have a buffer of at most 2ΔF=Θ(B) updates yet to be applied to the leaves of the subtree rooted at the node. Note that R is an upper bound on the total number of buffered updates along a root-to-leaf path in T. The essential property of the parameters is that R/B=Θ(H)=Θ(1εlogBN¯).

The geometric plane defined in Section 2.1 is partitioned into a set of axis aligned rectangles [x,y[×[v,w[, such that the number of updates in each rectangle is Θ(R). For each rectangle we store a list of the updates in the rectangle in lexicographical order by first the key and secondly the version of that update. Note that equal keys are group consecutively in the list. To allow efficiently locating a rectangle for a given version and key, we store a list indexed by version identifier, where we for each version v store a pointer to the root of a B-tree Pv over the rectangles left-to-right containing 𝒮v (see Section 2.4). The tree Pv has degree Θ(B) and does not store buffered updates. Further, we require that each rectangle contains Ω(R) keys which are present in all versions the rectangle spans. We denote such a key as spanning. If N¯=𝒪(R), all updates are stored in a single list.

New updates are buffered, to achieve I/O efficient update bounds. The topmost rectangles, which cover the current version vc, are all open, with all other rectangles being closed. Crucially, new updates are always performed in the current version. We maintain the invariant that for a buffered update, i.e., an update not having been flushed to the corresponding rectangle yet, the rectangle must be open.

For the open rectangles, we store a Bε-tree T, such that recent updates to the open rectangles are buffered. We let the maximum degree of an internal node in T be 2Δ1. Each internal node of T contains a buffer of up to 2ΔF updates, sorted lexicographically by key and version. Additionally, each update stores if the update is an insertion or deletion. Consider a full buffer, i.e., it contains at least 2ΔF updates, where each update should be flushed to one of the at most 2Δ1 children. Then, there must exist a subset of least F updates that should be flushed to the same child.

The setup is illustrated on Figure 1(right). The vertical black and gray lines represent the version intervals containing a key. The black lines represent updates present in the list of updates contained in that rectangle, while the gray lines and endpoints represent updates contained in buffers of T, which are illustrated at the top of the figure.

2.3 Handling Queries and Updates

When performing Search(v,x), first the rectangle r covering point (x,v) in the plane must be found. By using the B-tree Pv associated with version v, r can be found using 𝒪(H) I/Os. If r is closed, then all updates inside r are contained in the sorted list of updates stored in r, and these can be scanned in 𝒪(R/B) I/Os. If r is open, then the result of the successor query may be affected by buffered updates, which are not stored in r. The rectangle r must therefore be actualized, by merging all updates in buffers on the path from the root of T to r with the updates in r. The details of this operation are described in Section 2.4, where the actualize operation is shown to take amortized 𝒪(H) I/Os. After r is actualized, the query continues by scanning the updates of r. If the result of the successor query is not contained in the rectangle r, then by the spanning requirement, the result of the query must be in the neighboring rectangle to the right, that similarly is actualized if it is open. In total, the operation spends amortized 𝒪(H+R/B)=𝒪(1εlogBN¯) I/Os. Note that the operation easily can be modified to support member, predecessor, and strict predecessor or successor queries.

A Range(v,x,y) query is performed very similarly to a Search query. The query may however touch more than two rectangles. Note that for the at most two rectangles containing the endpoints of the query we do not necessarily report all keys they contain at version v. These rectangles can be found using amortized 𝒪(H+R/B) I/Os by the argument above. For each intermediate rectangle accessed (and possibly actualized if it is open), then by the spanning requirement, a constant fraction of the updates scanned result in reported keys. Since accessing a rectangle takes amortized 𝒪(H+R/B) I/Os, and each rectangle contains Θ(R)=Θ(BH) keys at version v, then amortized 𝒪(1/B) I/Os are spent for each key reported for an intermediate rectangle. In total, a Range query reporting K keys takes amortized 𝒪(H+R/B+K/B)=𝒪(1εlogBN¯+K/B) I/Os.

Each update, either an Insert or Delete, is applied to the current version vc of the set. The Bε-tree T contains all buffered updates to the open rectangles, which cover the current version vc. For an update operation, a tuple with the update and vc is added to the root buffer of T, which is stored in internal memory. In Section 2.4 it is shown that adding the update to the root buffer and handling possible buffer overflows takes amortized 𝒪(H/F)=𝒪(1εB1εlogBN¯) I/Os.

2.4 Flushing Buffers

To argue about the amortized cost of flushing the content of buffers down the tree T, we let the potential of each buffered update be 1/F multiplied by the height of the buffer the update is stored in, with the root buffer being at the largest height. One unit of released potential can cover 𝒪(1) I/Os. When adding an update to the tree, the root buffer is always stored in internal memory, and therefore no I/Os are needed to access it. However, the potential is increased by at most 1/FH, and the operation therefore uses amortized 𝒪(H/F) I/Os.

Buffer Overflows.

Each buffer at an internal node of T contains at most 2ΔF updates. If a buffer contains more than 2ΔF updates, then a buffer overflow is performed. Since each node of T has at most 2Δ1 children, at least F updates from the buffer must be to the same child. These updates can be moved together to the buffer of that child.

A buffer overflow can happen in two cases. Either when an update is placed into the root buffer as the result of an update operation, or when updates are placed into a buffer because the parent buffer is overflowing. Moving exactly F updates out of a buffer, is always sufficient to make an overflowing buffer non-overflowing again. A buffer overflow therefore only moves down a single path of T.

An overflowing buffer can be stored in 𝒪(1) blocks, since 2ΔF+F=𝒪(B), and therefore the F updates to move can be found in 𝒪(1) I/Os. If the overflowing updates are moved to a child buffer, these can be inserted via a merge in 𝒪(1) I/Os. As F updates are moved one level down the tree, then the potential decreases by 1, which is enough to cover the 𝒪(1) I/O cost of the overflow operation.

If the child is not an internal node of T, but an open rectangle, then merging the F overflowing updates into the list of updates in the rectangle takes 𝒪(R/B)=𝒪(H) I/Os. As buffer overflows only move down a single path of the tree, then Ω(H) overflows must have occurred before the overflow reaches the open rectangle. Merging the overflow into the list of updates in the open rectangle therefore does not increase the asymptotic amortized number of I/Os performed.

Actualizing.

When actualizing an open rectangle, all buffered updates to that open rectangle must be moved into the rectangle. Note that all relevant updates are in the buffers on the root-to-leaf path in T to the open rectangle. Each of these buffers contains at most 2ΔF=𝒪(B) buffered updates. The total number of buffered updates on the path is at most H2ΔF=R. For each node on the path from the root down to the rectangle the following is done. Let U be the updates on the path from the levels above in sorted order. Initially, U is empty. To extend U for each level top-down, U is merged with the relevant updates of the next buffer. This requires 𝒪(1+|U|/B) I/Os, by scanning U and the buffer. Since the U updates are moved one level down, they release potential |U|/F|U|/B, that can cover Θ(|U|/B) I/Os, i.e., the amortized cost for actualizing one level of the tree is 𝒪(1) I/Os. As there are 𝒪(H) levels of the tree, the at most R relevant updates can be found in sorted order in amortized 𝒪(H) I/Os. They can then be merged with the updates in the open rectangle in 𝒪(R/B) I/Os. In total, the actualize operation requires amortized 𝒪(H+R/B)=𝒪(H) I/Os.

Finalizing.

Each open rectangle is allowed to receive between R and 2R updates before it is finalized, converting it into a closed rectangle. When finalizing an open rectangle, all buffered updates to the rectangle are removed from T and merged with the rectangle, to ensure that all buffered updates in T are only to open rectangles. We now argue that open rectangles receive at most 2R updates in total by finalizing the rectangle as soon as R updates have been added to it. A finalize operation can be triggered from an actualize operation or from a buffer overflow. In both cases the number of updates in the rectangle before the operation is at most R1. An actualize operation may add at most R buffered updates to a rectangle, i.e., at most 2R1 total updates are placed in a finalized rectangle. If the rectangle receives an update from a buffer overflow, then the overflow must have been triggered by an update in the root buffer. Buffered updates to add to the rectangle can only be the R updates in buffers on the path, and the new update, which in total is at most (R1)+R+1=2R updates to add to the open rectangle. Thus, by finalizing a rectangle as soon as it receives at least R updates, it will receive between R and 2R updates.

Spanning Requirement.

We require that the first version of a rectangle contains [4R,8R[ keys. When finalizing a rectangle, [R,2R] updates have been performed and therefore at least 2R of the initial keys are still present, that is, span all versions of the rectangle. This ensures that the Ω(R) spanning keys requirement is met. When finalizing a rectangle, [2R,10R[ keys are contained in the rectangle at version vc. New open rectangles must be created to span the key range of the closed rectangle, containing the keys present at version vc. If [4R,8R[ keys are present at version vc, then a single rectangle suffices. If [8R,10R[ keys are present at version vc, then the range is split in two rectangles at the median key, both with [4R,5R] keys, and T must be updated as described below. Otherwise, at version vc the rectangle contains [2R,4R[ keys. In this case a sibling rectangle is finalized, to allow for a merge of two rectangles. The sibling rectangle holds [2R,10R[ keys at version vc. Concatenation the keys present at version vc from the two closed rectangles gives a sorted list of [4R,14R[ keys for a new open rectangle. A split may need to be performed if there are 8R or more keys, i.e., the result is one or two new open rectangles.

Updating the 𝑩𝜺-Tree 𝑻.

After finalizing open rectangles, the Bε-tree must be updated accordingly. If an open rectangle was split, then a new child is added to the parent node in T of the updated rectangle. If this increases the degree of the node to 2Δ, it is split by distributing its children into two new nodes, each with degree Δ. Its buffer is also partitioned so that each buffered update is placed in the buffer containing its relevant child. Splitting the node further introduces a new child to the parent of the node. Note that this may cascade up the tree, but only on the path towards the root. If a merge of the rectangle occurs, then a child is deleted from the parent. When merging rectangles, the merged rectangles must be siblings in the tree. The rectangle is merged with the left or right neighbor rectangle, which has a nearest common ancestor with the rectangle in T, to ensure that the key range of existing nodes only increase. This may cause the degree of nodes to be below Δ. Notably, we do not merge internal nodes of T, as this could create a large buffer that requires multiple flushes in different directions, known as flushing cascades [7]. We instead allow nodes to have a degree down to one, where deleting the last child results in deleting the path of consecutive degree-one from the child towards the root. As we show in Section 2.5, this does not affect the asymptotic height of the tree.

Updating the rectangles and the Bε-tree T upon finalizing therefore requires scanning 𝒪(R) keys and traversing a constant number of paths of length 𝒪(H) in T, which takes 𝒪(R/B+H)=𝒪(R/B) I/Os. As Ω(R) updates must be applied to a rectangle before finalizing it, this does not increase the asymptotic amortized cost of an update operation. Queries may also finalize rectangles, but already use amortized 𝒪(H) I/Os, causing no asymptotic query overhead.

Recall, that we maintain a list that for any version v stores a pointer to the root of a B-tree Pv over the rectangles left-to-right for version v. If the set of rectangles is unchanged from version vc1 to vc, we have Pvc=Pvc1. If a rectangle is finalized during version v, the set of rectangles changes and we need to create a new Pvc from Pvc1. Since only 𝒪(1) rectangles are removed and created in the new version, we can create Pvc from Pvc1 partially persistently by standard B-tree updates and rebalancing using naive path copying [31], using 𝒪(logB(Nv/R))=𝒪(H) I/Os and 𝒪(H)=𝒪(R/B) blocks of space.

Space Usage.

When finalizing a rectangle, Ω(R) updates must have occurred in that rectangle. New rectangles are then created, which in total copies 𝒪(R) updates, and one path of the tree is copied. As the height of the tree is at most H=𝒪(R/B), and the updates of the rectangles are stored in lists, the newly allocated space is 𝒪(R/B), which can be amortized over the Ω(R) updates required for the finalization to happen. In addition to the updates, initially N¯ keys are stored across 𝒪(N¯/R) rectangles in lists, and an initial balanced Bε-tree is built on these initial rectangles, causing an initial space of 𝒪(N¯/B) blocks. As the structure allows for at most αN¯ updates, the total space usage is 𝒪(N¯/B) blocks.

2.5 Bounding the Tree Height

In this section we show that H is an upper bound on the height of the Bε-tree T.

We define the weight of a node at height i in T to be the number of updates on keys in the key range of the node, and let wi be a lower bound for the weight of a node at height i. The rectangles are at height 0 of the tree, with the nodes of the tree starting at height 1. The updates are both the N¯ initial keys as well as the up to αN¯ additional updates. It holds that the weight of a node is the sum of the weights of its children. By induction on the number of updates we show that wiBΔi for all nodes at all heights, except for the root. First note that the inequality holds for i=0, as any rectangle contains at least 4RB updates when it was created. Initially, N¯ updates are distributed into at most N¯/(4R) rectangles, where the number of updates in each rectangle is at least 4RB. Each internal node initially has degree [Δ,2Δ[, except for the root that has degree [2,2Δ[. By induction on the tree height i, it holds that the initial tree satisfies wiBΔi, except for the root. Each update affects a root-to-leaf path of the tree. If the tree is not updated, then the weights of the nodes on the path can only grow, and therefore the inequality holds. If rectangles are merged, then one rectangle disappears together with all the ancestors having only this single rectangle as a leaf. The other rectangle and its ancestors up to the least common ancestor of the two merged leaves get their key ranges expanded. It follows that the surviving nodes after merging rectangles only can have their key range increase, and therefore the inequality holds. If a split occurs in any node at height i, then the degree of the node before the split is 2Δ. The node is split into two nodes at height i, each with Δ children. The weight of each of the two nodes is therefore at least Δwi1ΔBΔi1=BΔi. Therefore, it holds that wiBΔi.

Since the number of updates is at most (1+α)N¯, we have BΔi(1+α)N¯ for all nodes at height i, except for the root. Since by definition 2ΔB and α<1, we have Δi+12N¯, i.e., ilogΔ(2N¯)1logΔN¯. The height of T is then at most the largest value of i satisfying this inequality, plus one for the root, i.e., the height of T is at most 1+logΔN¯1+logΔN¯=H.

2.6 Global Rebuilding

The data structure above allows for an initial set of N¯ keys to receive up to αN¯ persistent updates, for a constant 0<α<1. For any version v, we have (1α)N¯Nv(1+α)N¯, i.e., Nv=Θ(N¯). Therefore, the asymptotic costs of all operations also hold with N¯ replaced by Nv.

To allow for more than αN¯ updates, we create multiple copies of the data structure above using global rebuilding [29, 30]. Whenever the current data structure reaches αN¯ updates, a new data structure is created with initial set 𝒮vc and N¯new=|𝒮vc| (and new H and R parameters), with a new set of rectangles and a new Bε-tree T, where all buffers are empty. We compute 𝒮vc by performing Range(vc,,) in amortized 𝒪(1εlogBN¯+K/B)=𝒪(N¯/B) I/Os. This actualizes all open rectangles which now can be closed. The new data structure can be build using 𝒪(N¯/B) I/Os by a single scan of the sorted list containing 𝒮vc. In the old data structure αN¯ updates have been performed before this rebuild is performed. By amortizing the rebuild cost over these updates, the amortized cost of each update is increased by 𝒪(1/B) I/Os, i.e., the asymptotic amortized cost of an update is not increased. As the space usage of the new data structure is 𝒪(N¯/B) blocks, a similar argument can be used to amortize the space usage over the updates, maintaining a linear space usage in the total number of updates. This concludes the proof of Theorem 1.

3 Worst-Case Bounds

In this section, we describe how to achieve worst-case I/O guarantees instead of amortized under progressively weaker assumptions on the internal memory size M. Previous approaches to improving the amortized performance of ephemeral Bε-trees, both in the randomized [7] and worst-case [16] setting, assumed at least that M=Ω(1εBlogBN¯)=Ω(HB), which allows all buffers on a path to be sorted in internal memory, i.e., 𝒪(𝑠𝑜𝑟𝑡(HB))=𝒪(H) I/Os. First, in Section 3.1, we show that if M=Ω(HB), our persistent structure can be deamortized without asymptotic overhead, by applying incremental global rebuilding for handling changing N¯ and performing the flushing along a root-to-leaf path incrementally. Then, in Section 3.2, we relax the assumption to M=Ω(B1εlog2N¯) (Theorem 2). We apply the subtracting game studied by Dietz and Raman [18] to argue that the buffer of a node can at most store 𝒪(Flog2Δ) updates towards each child, if we always recursively flush F updates to the child with most updates when a node receives F updates from its parent. This represents an improvement by a factor Bε/log2B on the assumption for the size of the internal memory. Finally, in Section 3.3, we show worst-case results when only assuming M2B, which introduces a small additive overhead on all operations (Theorem 3). We employ the zeroing game by Dietz and Sleator [19, Theorem 5] to avoid a multiplicative overhead for Range queries, by incrementally splitting and merging the open rectangle with the most updates.

3.1 Large Internal Memory Assumption

We first consider the case when M=Ω(HB). When actualizing a rectangle (see Section 2.3), the buffers at the 𝒪(H) nodes along the root-to-leaf path, with a total size of 𝒪(HB), are merged to produce a sorted list of updates to apply to the rectangle. As shown in Section 2.4, this can be done in amortized 𝒪(H) I/Os, by merging the buffers top-down. In the worst case, this requires 𝒪(H2) I/Os. By instead merging the buffers using an external memory sorting algorithm, the worst-case number of I/Os can be improved to 𝒪(𝑠𝑜𝑟𝑡(HB)). Previous approaches to improving the amortized performance of Bε-trees in the randomized [7] and worst-case [16] settings both assumed at least that M=Ω(HB), in which case the sorting term trivially disappears by performing the sorting internally after reading the H buffers into internal memory. The remaining challenge was handling flushing cascades, which occur when merging internal nodes of T results in large buffers requiring many flushes in different directions. For our structure, we avoid this issue, by never merging internal nodes, and instead maintain the height of T using global rebuilding. For the remainder of this section, we assume a large internal memory of size M=Ω(HB) and describe how to achieve worst-case guarantees by incrementally performing amortized work.

Queries.

Finding and actualizing a relevant rectangle for a query takes 𝒪(H+𝑠𝑜𝑟𝑡(HB))=𝒪(H) I/Os when M=Ω(HB). The worst case for a Search and Range query is therefore 𝒪(H) and 𝒪((1+K/R)H)=𝒪(H+K/B) I/Os, respectively. Note that for a Range query, for each rectangle that intersects the query, except for the leftmost and rightmost ones, Ω(R) keys are reported due to the Ω(R) spanning keys in each rectangle.

Updates.

When performing an update, it may be the αN¯’th update, which triggers a global rebuild of the structure based on a new N¯, which uses 𝒪(N¯/B) I/Os, as described in Section 2.6. However, by performing the global rebuilding incrementally [29, 30] over the next Θ(N¯) updates, this does not increase the asymptotic worst-case number of I/Os of each update. While initializing the new structure there are still updates happening which must then be applied before it can take over. By performing updates to the new structure at a sufficiently fast rate compared to the live structure this ensures that they stay within a constant factor of each other in size until the new structure takes over. Therefore, only the I/O cost of an update without global rebuilding needs to be considered.

Updates are inserted in the root buffer of the Bε-tree, as described in Section 2.4. By keeping the root buffer in internal memory this uses no I/Os. If the root buffer overflows, it may cause buffer overflows along a root-to-leaf path down to an open rectangle, which may then be finalized by performing an actualize operation followed by a path copy. The update therefore requires 𝒪(H) I/Os in total under the large internal memory assumption. However, each time the root buffer overflows, F updates are removed from it, meaning this occurs at most every Fth update. Thus, when the root buffer overflows, we incrementally apply the update to the structure over the next F updates, ensuring that 𝒪(H/F) I/Os are performed per update in the worst case. To not interfere with the incremental work, we place new updates in a separate buffer while it is in progress and merge them with the root buffer when it is finished. If a path copy has occurred, the root pointers of the F most recent versions must be updated to the new root. Since they are stored together in an array indexed by their version identifier this takes 𝒪(1) I/Os. If a query occurs while an update is being performed incrementally, we complete the update before executing the query. This does not increase the asymptotic worst-case number of I/Os for queries.

3.2 Smaller Buffers on All Paths Using the Subtraction Game

In the previous section, we showed that there is no overhead on worst-case queries and updates compared to the amortized bounds, if the internal memory can hold all keys on a root-to-leaf path towards the same open rectangle. To lower the possible number of such keys, we slightly change the flushing strategy described in Section 2.4 where we only performed a flush when a buffer overflowed. Instead, for every F’th update, we flush along an entire root-to-leaf path, always flushing towards the child where most of the updates are going. We still flush at most F keys, which preserves the property that internal nodes of T contain at most 2ΔF updates. In the following, we show that this flushing strategy guarantees that all buffers contain 𝒪(Flog2Δ) updates going towards the same child, and therefore also the same leaf. This implies that the assumption M=Ω(HFlog2Δ)=Ω(B1εlog2N¯) is sufficient to achieve no overhead for worst-case queries and updates. This is a factor Θ(Bε/log2B) improvement over the previous smallest assumption on M [16].

We can view each node as playing the subtracting game studied by Dietz and Raman [18] for the number of updates x1,x2,x3,,x2Δ1 going towards each of its at most 2Δ1 children. When at most F updates are flushed towards a node, and δi new updates are going towards the ith child, then variable xi is increased by δi. We flush towards the child j where most of the updates are going which sets the variable xj=max{xjF,0}. Following [18, Theorem 3], scaled by a factor of F, this guarantees xi=𝒪(Flog2Δ) for any i.

We also need to consider how merging and splitting nodes in T impacts the subtraction games. Only leaves of T, corresponding to open rectangles, are merged. When an internal node is split, it corresponds to evenly distributing the xi variables from one game to two new games, except for one variable that is split into two new variables, each with a smaller or equal value. When a leaf, i.e., an open rectangle, is merged or split, the one or two rectangles involved are first actualized, which sets their variables to zero, a stronger operation than subtracting. Thus, the variable for a new rectangle is always zero and variables on root-to-leaf paths to actualized rectangles may be decremented. In all cases and for all games, variables are either decremented without adding to the game, or a copy of an existing game is created, where all variables in the copy are equal or smaller in value than before. This concludes the proof of Theorem 2.

3.3 Improving Worst-Case RANGE Queries Using the Zeroing Game

In this section, we consider the small-memory setting with M2B, to overcome the theoretical limitation of the memory assumptions made in Sections 3.1 and 3.2. Actualizing a rectangle by merging the relevant updates on a root-to-leaf path to a rectangle requires 𝒪(H+γ) total I/Os, where γ=𝑠𝑜𝑟𝑡(B1εlog2N¯). The construction from the previous section directly results in worst-case Search queries using 𝒪(H+γ) I/Os and updates using 𝒪(1F(H+γ)) I/Os. However, since Range queries are performed by repeatedly searching for the Θ(1+K/R) rectangles intersecting the query, the worst-case number of I/Os is Θ((1+K/R)(H+γ))=Θ(H+γ+KB(1+εlog2BBεlogM/B(Bεlog2N¯))), notably with a multiplicative non-constant overhead on the reporting term. In this section, we describe how to guarantee Range queries in worst-case Θ(H+γ+KB) I/Os when M2B.

The worst-case I/O cost of a Range query can be improved by merging all the buffered updates to the open rectangles intersecting the query in a top-down, level-by-level fashion. That is, by essentially actualizing all the open rectangles intersected by the Range query simultaneously. We denote the updates already present in the rectangles the partial output list. Rather than applying the buffered updates to the rectangles, we merge them with the partial output list to obtain the final output. A given query Range(v,x,y) reports Ω(R) keys from each intermediate rectangles, i.e., all rectangles intersecting the query except for the two that contain the endpoints x and y. Thus, in 𝒪((1+K/R)H+(R+K)/B)=𝒪(H+K/B) I/Os we can find all the relevant rectangles and compute the partial output list. To collect the buffered updates in T for the intermediate open rectangles in sorted order, we merge the updates down level-by-level. We only move down the updates to versions earlier or equal to v since only these can affect the query result. Once obtained, these updates are merged with the partial output list using linear I/Os to report the output of the Range query.

The buffered updates are stored in T, which contains the open rectangles at version vc, however, the Range query is on the rectangles present at version v. Let Tv denote the Bε-tree on open rectangles at version v, i.e., the state of T when version v was created. From Tv to T the tree may have changed, but no later updates are relevant for the query. Thus, the total number of relevant updates does not increase from Tv to T, and each update remains on the root-to-leaf path towards the open rectangle to which the update is relevant. The relevant updates in T can be collected in sorted lists ordered by level by traversing each root-to-leaf path in T towards open rectangles intersecting the query using 𝒪((1+K/R)H) I/Os. To bound the I/Os to move the updates down level-by-level, we show that the total number of updates is 𝒪(B1εlog2N¯+K/H). To this end, we need the additional invariant that the buffer of a degree one node is empty, which we show how to obtain below. Let Tv be the subtree of Tv consisting of all nodes on root-to-leaf paths to rectangles intersecting the query. Then split Tv into two root-to-leaf paths px and py to x and y, respectively, along with all the subtrees hanging off px or py. For a node on px (symmetrically py) of degree at least two there may be one or more subtrees Tsub hanging off the node. Since only the nodes of Tsub with degree at least two have non-empty buffers, if Tsub has leaves, the number of buffered updates in Tsub is at most 𝒪((1)B). Thus, the number of buffered updates in Tv excluding degree one nodes on px and py is 𝒪(K/H). A degree one node on px and py may have a large degree in Tv, but since it only has one child in the direction of the query, due to the subtracting game, it stores at most 𝒪(Flog2Δ) relevant updates. The number of degree one nodes on px and py is at most 2H, and they together contribute 𝒪(B1εlog2N¯) buffered updates, which we locate and sort separately using 𝒪(H+γ) I/Os. For the remaining 𝒪(K/H) buffered updates, we merge them level-by-level using 𝒪(H+K/B) I/Os. In total, the worst-case number of I/Os to perform a Range query is 𝒪(H+γ+K/B) I/Os.

Empty Buffers for Degree one Nodes.

To ensure that each node of degree one has an empty buffer, we alter the buffer capacity of nodes to scale with the degree. Let the capacity of the buffer of a node with degree d2 be at most Fmin{2d,2Δ}, with nodes of degree one having a buffer capacity of 0. When flushing a node, as the maximum number of updates in the buffer scales with the degree, then at least F updates going to the same child can be found when overflowing. When splitting a node, it must have degree 2Δ, resulting in the two new nodes having degree Δ, which therefore does not decrease the buffer capacity, and flushing is not needed. When a child of a node is removed due to merging rectangles, the degree of the node is decreased by one. If the degree remains at least two, at most two flushes are required to get the buffer capacity within bounds. Otherwise, if the degree drops from two to one, at most four flushes are needed. To avoid cascading merges of rectangles, we do not finalize a rectangle once it receives a certain number of updates. Instead, we finalize the rectangle that has received the most updates, provided it has received at least R updates. This last condition ensures a bound on the space usage. Including the initial flush from the root buffer, then when a rectangle is finalized, at most 5F updates have been flushed into open rectangles.

Let Ui denote the number of updates to the ith rectangle, excluding the initial insertions. We extend the data structure to include an array over all open rectangles, where index j stores a blocked-linked-list of all rectangles where the number of updates is Ui=j. Each rectangle has a double linked pointer between its location in the array of lists and the rectangle. This allows for moving a rectangle to a new entry in the array, when it receives updates, as well as finding a rectangle which has received the most updates, by scanning the list.

To show that Ui is bounded by 𝒪(R), we apply the zeroing game of Dietz and Sleator [19], using the variables xi=max{0,UiR5F} if rectangle i is open and xi=0 if it is closed. For open rectangles, xi counts the number of units of 5F updates received beyond the first R updates. This ensures that the variables are incremented by at most 1 in total for each round, when at most 5 flushes of size at most F are flushed into the open rectangles. When finalizing a rectangle, it becomes closed, which ensures that xi=0, matching the zeroing step. We bound the total number of rectangles by N¯ and therefore also the number of variables. Following [19, Theorem 5] and the refined analysis in [13, Theorem 4], we have that for all i, xilog2N¯+1 at all times. Consequently, it follows that Ui5F(log2N¯+1)+R. It can be shown that 5F(log2N¯+1)2R for N¯4, by simplifying the inequality using R=2ΔF(1+logΔN¯) and showing 54(1+1log2N¯)Δlog2Δ, for all Δ2 and N¯4. It therefore holds that each rectangle receives at most Ui3R updates, due to the zeroing game. Thus, when finalizing a rectangle, at least R updates have been performed. Including the updates from the buffers on the path towards the rectangle, the total number of updates applied is between R and 4R. By ensuring that each rectangle contains [8R,16R[ initial keys, the rebalancing operations are possible, and the spanning requirement remains satisfied.

An update therefore performs at most 5 flushes using 𝒪(H) I/Os, along with locating and finalizing a single rectangle in respectively 𝒪(R/B)=𝒪(H) and 𝒪(H+γ) I/Os. Performing this operation incrementally allows for updates to spend worst-case 𝒪(1F(H+γ)) I/Os. If a query happens while an incremental update is being performed, the incremental update is completed, using at most 𝒪(H+γ) I/Os, which does not increase the total cost of the query. When a Search query happens, similar to the new Range query, we do not apply the relevant updates on the path to the open rectangle to avoid queries interfering with the zeroing game. This concludes the proof of Theorem 3.

4 Discussion and Open Problems

Global rebuilding, as described in Section 2.6, allows for constructing a partially-persistent set of any sorted set in a linear number of I/Os, without recreating the set by a sequence of insertions. Symmetrically, it is possible to purge all versions of the set older than some threshold without performing all updates again. This problem was first motivated by Becker, Gschwind, Ohler, Seeger, and Widmayer [6]. As our data structure consists of multiple independent data structures covering disjoint version intervals, then all data structures which only cover versions to be purged can be removed efficiently by a linear number of I/Os. For both use cases, space usage is asymptotically linear in the size of the oldest stored set and the number of updates performed.

Further, global rebuilding allows for a crude fully persistent data structure, which supports efficient buffered updates and queries, but where cloning past versions requires a linear number of I/Os. The fully persistent data structure by Brodal, Rysgaard, and Svenning [12] allows cloning past versions in worst case 𝒪(1) I/Os. They do, however, not buffer updates, which therefore are amortized and a factor 𝒪(εB1ε) slower than our data structure. Our data structure is therefore better when there are many updates, but few clones of past versions happening. Further, our data structure is simpler. It remains an open problem to design buffered fully-persistent search-trees that remain efficient for clone operations.

In Section 3 we showed how to achieve worst-case bounds matching those of ephemeral Bε-trees, when M=Ω(B1εlog2N). This is an improvement by a factor Θ(Bε/log2B) on the required lower bound on M over the worst-case results of Bε-trees by Das, Iacono, and Nekrich [16]. It remains an open problem to show a worst-case I/O lower bound dependency on M, or alternatively, to find a structure with worst-case I/O guarantees matching the amortized I/O bounds for M=2B.

References

  • [1] Alok Aggarwal and Jeffrey S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, September 1988. doi:10.1145/48529.48535.
  • [2] Lars Arge. The buffer tree: A technique for designing batched external data structures. Algorithmica, 37(1):1–24, September 2003. doi:10.1007/s00453-003-1021-x.
  • [3] Lars Arge, Andrew Danner, and Sha-Mayn Teh. I/O-efficient point location using persistent B-trees. ACM Journal of Experimental Algorithmics, 8:1.2–es, December 2004. doi:10.1145/996546.996549.
  • [4] Lars Arge, Darren Erik Vengroff, and Jeffrey Scott Vitter. External-memory algorithms for processing line segments in geographic information systems. Algorithmica, 47:1–25, 2007. doi:10.1007/s00453-006-1208-z.
  • [5] Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 1:173–189, 1972. doi:10.1007/BF00288683.
  • [6] Bruno Becker, Stephan Gschwind, Thomas Ohler, Bernhard Seeger, and Peter Widmayer. An asymptotically optimal multiversion B-tree. The VLDB Journal, 5(4):264–275, 1996. doi:10.1007/s007780050028.
  • [7] Michael A. Bender, Rathish Das, Martín Farach-Colton, Rob Johnson, and William Kuszmaul. Flushing without cascades. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 650–669. Society for Industrial and Applied Mathematics, 2020. doi:10.1137/1.9781611975994.40.
  • [8] Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. Cache-oblivious streaming B-trees. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 81–92. Association for Computing Machinery, 2007. doi:10.1145/1248377.1248393.
  • [9] Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. An introduction to B-trees and write-optimization. ;login:, 40(5), 2015. URL: https://www.usenix.org/publications/login/oct15/bender.
  • [10] Gerth Stølting Brodal and Rolf Fagerberg. Lower bounds for external memory dictionaries. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 546–554. Society for Industrial and Applied Mathematics, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644201.
  • [11] Gerth Stølting Brodal and Rolf Fagerberg. On the limits of cache-obliviousness. In Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing, pages 307–315. Association for Computing Machinery, 2003. doi:10.1145/780542.780589.
  • [12] Gerth Stølting Brodal, Casper Moldrup Rysgaard, and Rolf Svenning. External memory fully persistent search trees. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1410–1423. Association for Computing Machinery, 2023. doi:10.1145/3564246.3585140.
  • [13] Gerth Stølting Brodal, Casper Moldrup Rysgaard, and Rolf Svenning. Buffered partially-persistent external-memory search trees. CoRR, abs/2503.08211, 2025. doi:10.48550/arXiv.2503.08211.
  • [14] Adam L. Buchsbaum, Michael Goldwasser, Suresh Venkatasubramanian, and Jeffery R. Westbrook. On external memory graph traversal. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 859–860. Society for Industrial and Applied Mathematics, 2000. URL: http://dl.acm.org/citation.cfm?id=338219.338650.
  • [15] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc., 2022. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf.
  • [16] Rathish Das, John Iacono, and Yakov Nekrich. External-memory dictionaries with worst-case update cost. In Sang Won Bae and Heejin Park, editors, 33rd International Symposium on Algorithms and Computation, ISAAC 2022, December 19-21, 2022, Seoul, Korea, volume 248 of LIPIcs, pages 21:1–21:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ISAAC.2022.21.
  • [17] Erik D. Demaine, John Iacono, and Stefan Langerman. Retroactive data structures. ACM Trans. Algorithms, 3(2):13–es, May 2007. doi:10.1145/1240233.1240236.
  • [18] Paul F. Dietz and Rajeev Raman. Persistence, randomization and parallelization: On some combinatorial games and their applications (abstract). In Frank K. H. A. Dehne, Jörg-Rüdiger Sack, Nicola Santoro, and Sue Whitesides, editors, Algorithms and Data Structures, Third Workshop, WADS ’93, Montréal, Canada, August 11-13, 1993, Proceedings, volume 709 of Lecture Notes in Computer Science, pages 289–301. Springer, 1993. doi:10.1007/3-540-57155-8_256.
  • [19] Paul F. Dietz and Daniel Dominic Sleator. Two algorithms for maintaining order in a list. In Alfred V. Aho, editor, Proceedings of the 19th Annual ACM Symposium on Theory of Computing, 1987, New York, New York, USA, pages 365–372. Association for Computing Machinery, 1987. doi:10.1145/28395.28434.
  • [20] James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38(1):86–124, 1989. doi:10.1016/0022-0000(89)90034-2.
  • [21] Herbert Edelsbrunner and Mark H. Overmars. Batched dynamic solutions to decomposable searching problems. Journal of Algorithms, 6(4):515–542, 1985. doi:10.1016/0196-6774(85)90030-6.
  • [22] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. ACM Transactions on Algorithms, 8(1), January 2012. doi:10.1145/2071379.2071383.
  • [23] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pages 47–57. Association for Computing Machinery, 1984. doi:10.1145/602259.602266.
  • [24] Michael Hamann, Ulrich Meyer, Manuel Penschuck, Hung Tran, and Dorothea Wagner. I/O-efficient generation of massive graphs following the LFR benchmark. ACM Journal of Experimental Algorithmics, 23, August 2018. doi:10.1145/3230743.
  • [25] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. BetrFS: Write-optimization in a kernel file system. ACM Transactions on Storage, 11(4), November 2015. doi:10.1145/2798729.
  • [26] Curtis P. Kolovson and Michael Stonebraker. Indexing techniques for historical databases. In Proceedings of the Fifth International Conference on Data Engineering, February 6-10, 1989, Los Angeles, California, USA, pages 127–137. IEEE Computer Society, 1989. doi:10.1109/ICDE.1989.47208.
  • [27] Andrea Lancichinetti and Santo Fortunato. Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, 80:016118, July 2009. doi:10.1103/PhysRevE.80.016118.
  • [28] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78:046110, October 2008. doi:10.1103/PhysRevE.78.046110.
  • [29] Mark H. Overmars. The design of dynamic data structures, volume 156. Springer Science & Business Media, 1983. doi:10.1007/BFb0014927.
  • [30] Mark H. Overmars and Jan van Leeuwen. Dynamization of decomposable searching problems yielding good worsts-case bounds. In Peter Deussen, editor, Theoretical Computer Science, 5th GI-Conference, Karlsruhe, Germany, March 23-25, 1981, Proceedings, volume 104 of Lecture Notes in Computer Science, pages 224–233. Springer, 1981. doi:10.1007/BFB0017314.
  • [31] Neil Sarnak and Robert E. Tarjan. Planar point location using persistent search trees. Communications of the ACM, 29(7):669–679, July 1986. doi:10.1145/6138.6151.
  • [32] Peter J. Varman and Rakesh M. Verma. An efficient multiversion access structure. IEEE Transactions on Knowledge and Data Engineering, 9(3):391–409, 1997. doi:10.1109/69.599929.