Abstract 1 Introduction 2 Definitions 3 Informal Technical Roadmap References

Decentralized Data Archival: New Definitions and Constructions

Elaine Shi111Authors are listed in random order. ORCID Computer Science Department and Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA Rose Silver ORCID Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA Changrui Mu ORCID Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA
Abstract

We initiate the study of a new abstraction called incremental decentralized data archival (𝗂𝖣𝖣𝖠). Specifically, imagine that there is an ever-growing, massive database such as a blockchain, a comprehensive human knowledge base like Wikipedia, or the Internet archive. We want to build a decentralized archival system for such datasets to ensure long-term robustness and sustainability. We identify several important properties that an 𝗂𝖣𝖣𝖠 scheme should satisfy. First, to promote heterogeneity and decentralization, we want to encourage even weak nodes with limited space (e.g., users’ home computers) to contribute. The minimum space requirement to contribute should be approximately independent of the data size. Second, if a collection of nodes together receive rewards commensurate with contributing a total of m blocks of space, then we want the following reassurances: 1) if m is at least the database size, we should be able to reconstruct the entire dataset; and 2) these nodes should actually be committing roughly m space in aggregate – specifically, when m is much larger than the data size, these nodes cannot store only one copy of the database, and be able to impersonate arbitrarily many pseudonyms and get unbounded rewards.

We propose new definitions that mathematically formalize the aforementioned requirements of an 𝗂𝖣𝖣𝖠 scheme. We also devise an efficient construction in the random oracle model which satisfies the desired security requirements. Our scheme incurs only O~(1) audit cost, as well as O~(1) update cost for both the publisher and each node, where O~() hides polylogarithmic factors. Further, the minimum space provisioning required to contribute is as small as polylogarithmic.

Our construction exposes several interesting technical challenges. Specifically, we show that a straightforward application of the standard hierarchical data structure fails, since both our security definition and the underlying cryptographic primitives we employ lack the desired compositional guarantees. We devise novel techniques to overcome these compositional issues, resulting in a construction with provable security while still retaining efficiency. Finally, our new definitions also make a conceptual contribution, and lay the theoretical groundwork for the study of 𝗂𝖣𝖣𝖠. We raise several interesting open problems along this direction.

Keywords and phrases:
Decentralized Data Archival
Copyright and License:
[Uncaptioned image] © Elaine Shi, Rose Silver, and Changrui Mu; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Security and privacy Mathematical foundations of cryptography
Related Version:
Full Version: https://eprint.iacr.org/2025/969 [37]
Acknowledgements:
We thank Ashrujit Ghoshal for helpful discussion on techniques for proving time-space tradeoff. We also thank Hao Chung for helpful discussons on data availability protocols.
Funding:
This work is in part supported by NSF awards 2212746, 2044679, a Packard Fellowship, and a generous gift from the late Nikolai Mushegian. CM is additionally supported by the National Research Foundation, Singapore, under its NRF Fellowship program, award number NRF-NRFF14-2022-0010. RS is additionally supported by the Carnegie Mellon University CyLab Presidential Fellowship.
Editor:
Shubhangi Saraf

1 Introduction

We consider the problem of building a decentralized data archival system for evolving databases, henceforth called incremental Decentralized Data Archival (𝗂𝖣𝖣𝖠). One primary motivation comes from blockchains. Today, running an Ethereum archival node that backs up the historical transaction logs requires 2TB to 12TB of storage, and the space requirement will continue to grow. A key challenge is how to incentivize nodes to archive the historical logs. In particular, consensus participants only need to maintain the up-to-date state (only 100-200GB today) to remain functional. As a result, the consensus rewards alone do not provide sufficient incentive for storing the entire transactional log. Besides blockchains, 𝗂𝖣𝖣𝖠 schemes can also be used to build a decentralized backup of the Internet archive (e.g., archive.org, hundreds of petabytes in size), or an encyclopedia of human knowledge (e.g., Wikipedia, 10+ TB including all history and media).

In an 𝗂𝖣𝖣𝖠 scheme, each node will store a (carefully chosen) shard of the dataset, and this shard can evolve over time as the database grows. Furthermore, a node can get remunerated for its contribution through periodical audits. Informally speaking, if a node passes the audit, it means that it has not only committed the purported amount of space S, but is also using this space to store actual data and not junk.

Desiderata.

A dream 𝗂𝖣𝖣𝖠 scheme should satisfy the following desiderata:

  1. 1.

    Permissionless and low barrier to entry. We want an open (i.e., permissionless) system, where anyone can join and contribute, using the spare disk space they have on their home computers, without requiring special hardware provisioning. Specifically, this means that 1) the entire data size n can be significantly larger than the any node’s available space S; and 2) as the database grows, the nodes need not provision new disk space to continue participation. Philosophically, a low barrier to entry encourages more users to contribute, thus leading to increased decentralization, heterogeneity, and robustness.

  2. 2.

    Approximate best-possible recoverability. The strongest recoverability guarantee one can hope for is the following: if any subset of nodes can successfully pass the audit and moreover, their total claimed space is sufficient to hold the entire dataset, then it is possible to reconstruct the dataset in full. However, if each node stores some random shard of the dataset, then strict best-possible recoverability may be too strong to achieve. Therefore, we relax the notion to an approximate version by allowing for ϵ slack, for an arbitrarily small constant ϵ(0,1). Specifically, we require that if any μ nodes–each claiming to have committed S space–can all pass the audit, and μS(1+ϵ)n, then we can successfully reconstruct the entire dataset.

    The ϵ-best-possible-recoverability notion can also be further generalized to require that if any μ nodes each with purported S space can pass the challenge, then we can extract roughly min(n,(1ϵ)μS) amount of useful entropy (assuming a randomly chosen database).

  3. 3.

    Replication security. To get α times the fair rewards, a node must commit at least (αϵ)S blocks of space, where ϵ(0,1) is an arbitrarily small constant. This definition ensures that a powerful node with ample space cannot store only one copy of the database, and be able to impersonate arbitrarily many nodes and request unbounded rewards. Instead, we want to ensure that if a node is obtaining rewards commensurate with αn copies of the database, it has indeed dedicated roughly αn amount of space. Inspired by prior works [23, 22, 34], our replication security notion implies an ϵ-Nash equilibrium for rational nodes, that is, a node cannot earn noticeably more rewards if it deviates from honest behavior.

The combination of the above requirements necessitates a strong security definition. In particular, due to the permissionless nature, it could be that the μ nodes that can pass the audit (in the approximate best-possible recoverability or the replication security notions) are in fact pseudonyms controlled by the same adversary. To handle this challenge, our formal definitions implicitly require that security must hold even if all participating storage nodes are adversarially controlled.

Such an adversary can mount powerful attacks to gain an advantage, and our scheme must defend against such attacks. For example, if each node’s shard assignment is dependent on its identity (e.g., public key), then an adversary can choose a set of identities to maximally overlap the blocks assigned to the adversarial identities. This may allow the adversary to claim k times of the fair reward without committing k times the required space. The adversary can also maliciously select identities to censor specific blocks, thus undermining availability.

Why prior works fail to solve 𝗶𝗗𝗗𝗔.

Although our 𝗂𝖣𝖣𝖠 abstraction may bear superficial resemblance to known abstractions such as proofs of retrievability (PoR) [30, 32, 35, 21, 11, 4, 36, 38, 13], data availability sampling (DAS) [27, 28, 14, 3, 26], proofs of replication (PoRep) [22, 22, 34]222In this paper, we use PoR to refer to proofs of retrievability, and we use PoRep or “replication encoding” to refer to proofs of replication., and verifiable information dispersal (VID) [24, 33, 7, 41, 12, 29], none of these known abstractions are adequate for solving the 𝗂𝖣𝖣𝖠 problem. To the best of our knowledge, all prior works make one or more of the following (implicit) assumptions:

  • The node is being audited for storing the entire database (or block), and not a piece of the database – an assumption made by prior works on PoR [30, 32, 35, 21, 11, 4, 36, 38, 13], DAS [27, 28, 14], and PoRep [22, 22, 34].

  • A separate instance of the scheme is applied on a per-block basis – an assumption made by prior works on DAS [27, 28, 14], and VID [24, 33, 7, 41, 12, 29]. In Section 1.2, we argue why this “separate instance per block” paradigm cannot satisfy our requirements.

  • A threshold fraction of the players must be honest – an assumption typically made by the VID line of work [24, 33, 7].

  • The database is static. The aforementioned approach of applying a separate instance per-block can also be viewed as having many small, static databases.

We provide a more detailed comparison with the related work in Section 1.2.

1.1 Our Results and Contributions

New definitions.

To the best of our knowledge, we are the first to initiate a formal treatment of the 𝗂𝖣𝖣𝖠 problem. We provide formal definitions that mathematically capture the aforementioned intuitive requirements of approximate best-possible recoverability and replication security. Importantly, as mentioned, our definitions implicitly require that security hold even when the adversary controls all participating pseudonyms. In particular, we stress that “honest majority” style assumptions are not a great fit in a decentralized/permissionless setting when an adversary can generate arbitrarily many pseudonyms.

Efficient construction.

We construct an efficient 𝗂𝖣𝖣𝖠 scheme in the random oracle model that supports append-only updates. Our scheme achieves low barrier to entry: the minimal space provisioning required to contribute is only polylogarithmic. Furthermore, the amortized update cost is logarithmic (or slightly more than logarithmic) for each node as well as the publisher. Importantly, we prove the security of our construction even when all the pseudonyms in the system are controlled by the same adversary.

Below, let λ be the security parameter, let B be the number of bits per block, let S be the space provisioning per storage node, let N be the maximum database size, and let n0 be the initial database size. Note that S, N, and n0 are measured in the number of blocks.

Theorem 1 (Main theorem).

Assume the random oracle model. Let ϵ(0,1) be an arbitrarily small constant. Further, suppose B𝗉𝗈𝗅𝗒(λ)logλ and S𝗉𝗈𝗅𝗒(λ)logλ for some suitable polylogarithmic function, and n0max(S,λ). There exists an 𝗂𝖣𝖣𝖠 scheme that satisfies ϵ-best-possible recoverability as well as ϵ-replication security, with the following performance bounds where the costs are amortized over Nn0 number of updates, and the ω(1) term represents an arbitrarily small super-constant function in λ:

  • Amortized per-node download bandwidth: each update incurs BO(1+Sloglogλ/N) download bandwidth which is simply O(B) for S=O(N/loglogλ);

  • Amortized per-node computation: each update incurs O(BlogN)ω(1) node computation for some arbitrarily small super-constant function ω(1).

  • Publisher computation: the publisher pays BeO(1)/ϵlogN computation per update to maintain its data structure;

  • Audit cost: BlogλlogNω(1).

We stress that under our security requirements, it is inevitable that each node must incur at least constant cost per update. Since otherwise, if 1% of the nodes are not aware of the new update, the adversary can erase the other 99% of nodes and cause this new block to be lost, even if the remaining 1% nodes actually have enough space to store the entire dataset. In this sense, our per-update costs are (nearly) optimal.

Our scheme can also be easily extended to a setting where nodes have heterogeneous space provisioning, provided that the minimum space per node S is at least polylogarithmic. In this non-uniform space setting, a node with kS need not incur k times the update and audit costs – it still enjoys the same update and audit costs as stated above. See Section 7.1 of the full version for more details.

Technical highlight.

We first show how to combine techniques from PoR and PoRep in a non-blackbox fashion to get a decentralized data archival scheme for a static database (see Section 3.1 here and Section 5 of the full version). The most naïve way to get a dynamic scheme is to rerun the static scheme upon every update, but the cost per update would be linear in the data size. The key question is how to make the scheme dynamic without this prohibitive cost.

At first sight, it might be tempting to think that directly applying the standard hierarchical data structure of Bentley and Saxe [9] can turn any static scheme into a dynamic one. Recall that the hierarchical data structure divides the dataset into logarithmically many levels, of size 1,2,4,,n, respectively, where the smaller levels contain the fresher data blocks. It then runs a separate instance of the static scheme per level. Each level of size 2 needs updating only every 2 steps, and thus the amortized update cost is logarithmic.

Unfortunately, we observe this approach does not generically work for any cryptographic scheme. In our case, there are two reasons why a straightforward application of the hierarchical data structure fails:

  1. 1.

    Our security definition lacks compositional guarantees. Our notion of ϵ-best-recoverability includes an implicit accounting requirement: data recovery is only possible if the total storage provisioned by all nodes that pass the audit slightly exceeds the size of the dataset. However, when we have multiple instances of the static scheme – each corresponding to one of the logarithmically many levels – this condition is not necessarily met in a synchronized fashion across all levels. As a result, even if each instance individually satisfies ϵ-best-recoverability, their combined system may fail to satisfy the same notion globally.

  2. 2.

    The underlying cryptography lacks compositional guarantees. We make use of a PoRep scheme to compute a replication encoding for each level in the hierarchical data structure. To formally prove security, we need the underlying PoRep scheme to satisfy a certain form of adaptive sequential composability, that is, we want the PoRep’s security to hold even the adversary may choose some instance’s data in a way that depends on another instance’s replication encoding. Unfortunately, known PoRep constructions [23, 22, 34] do not provide the desired compositional guarantees.

To solve the first challenge, we devise a new space allocation scheme to allocate a node’s space among the multiple levels – see Section 3.2 for details. For the second challenge, we are not aware of any approach to extend the proofs in earlier works [23, 22, 34] to satisfy the desired adaptive sequential composition. Specifically, existing techniques for proving space-time tradeoffs through direct incompressibility arguments are highly involved and apply to extremely limited settings [19, 20]. While some other proof techniques [39, 18, 1, 25, 2] have been shown to prove space-time tradeoffs, they do not produce meaningful results in our setting. Instead, we devise a method to side-step the lack of composition of the underlying PoRep. Specifically, we modify our construction and force a storage node to locally recompute the replication encodings of all levels upon every update, using the new digest of the entire database to seed the underlying random oracle used in the PoRep scheme. We show that with this modification, we can prove security even when the underlying PoRep scheme does not provide the desired compositional guarantees.

Unfortunately, this modification also incurs significant extra costs. Specifically, each storage node would now have to pay at least BS cost per update to recompute the replication codes of all levels, where B is the block size and S is the blocks of space allocated by a node. In Section 3.4, we devise some additional algorithmic tricks to avoid this extra cost blowup, and bring the amortized update cost back down, to O~(1)B.

Philosophical discussions: separate archival and retrieval services.

In our definitions, we adopt the same philosophy suggested in a line of prior works [38, 7]. Specifically, we do not aim to support efficient read (by index) in the 𝗂𝖣𝖣𝖠 abstraction itself. This is a deliberate choice, as efficient (authenticated) read can easily be handled with a separate (possibly distributed) retrieval service provider who may simply store a cleartext copy of the dataset along with the Merkle openings [7, 38], allowing users request any specific block. A separate reward system can be used to incentivize the retrieval provider. Moreover, the retrieval service can be designed to optimize efficiency without worrying about redundancy and robustness. Philosophically, by decoupling the problem of providing retrieval from that of data archival, this definitional approach expands the design space and allows for more efficient constructions. For example, our construction can leverage the more efficient erasure codes rather than locally decodable codes.

Open questions.

Our work raises several natural open questions. First, although we manage to side-step the underlying PoRep’s lack of composition, our work nonetheless leaves open the following interesting question: How can we design a PoRep scheme with the desired adaptive sequential composition property? Likely the reason why the prior works [23, 22, 34] never considered this natural compositional notion is exactly because they (implicitly) considered a setting with a static database, where every node has ample space to store the entire database. A related open question is whether we can design an 𝗂𝖣𝖣𝖠 scheme that is itself composable. Another theoretically interesting question is whether the random oracle needed for the shard selection can be avoided. On the practical front, it would be interesting to devise a practical variant of our construction. Specifically, for the blockchain context, instantiating the publisher without trust using efficient Incrementally Verifiable Computation (IVC) would be highly relevant – see Section 7.2 of the full version for details. Our paper focuses on append-only updates, so a natural direction is to extend the scheme to support other types of updates.

Roadmap.

In the following subsection, we will describe related work in more detail. In Section 2, we will define the iDDA problem and corresponding security definitions more formally. In the last section of this article, we will present an informal technical roadmap describing our constructions.

1.2 Related Work

We now explain in more detail why our 𝗂𝖣𝖣𝖠 abstraction is different in nature from other abstractions that have been studied in the past.

Why “one instance per block” cannot satisfy our requirements.

While the prior works on Data Availability Sampling (DAS) [27, 28, 14, 3, 26] and and Verifiable Information Dispersal (VID) [24, 33, 7, 41, 12, 29] may bear superficial resemblance to our problem at first sight, these works are of fundamentally different nature, partly because they (implicitly) assume that a separate instance of the scheme will be applied on a per-block basis.

We argue that the “separate instance per block” approach cannot satisfy our requirements. Under this approach, either almost all nodes must store some (encoded) fragments of every block, or only a subset of the nodes are responsible for a block. The former case necessitates per-node storage that is linear in the dataset size, thus violating the “low barrier to entry” requirement; whereas the latter case is prone to a selective censorship attack if all nodes responsible for a block become corrupted. Another way to see this is that even if μ nodes can pass the audit and their purported total space exceeds the data size, we still may not be able to recover the dataset, since the identities of these nodes can be adversarially chosen such that none of them is responsible for storing a particular block.

Next, we review the prior works on PoR, DAS, PoRep, and VID one by one, and give more reasons why all of them are of fundamentally different nature from our 𝗂𝖣𝖣𝖠 abstraction, despite bearing some superficial resemblance at first sight.

Proofs of retrievability and data-availability sampling.

In proofs of retrievability (PoR) [30, 32, 35, 21, 11, 4, 38, 13, 36], an untrusted node can prove that it is indeed correctly storing the data it is asked to store, and that no data loss has occurred. The security definition requires that if a node can successfully pass the audit, then we can extract the entire dataset by rewinding the node and feeding it with many different challenges. A couple works have explored how to extend PoR schemes to support a dynamically evolving database [38, 13].

Data availability sampling [27, 28, 14, 10, 17] can be viewed as a strengthening of PoR. Specifically, in PoR, we assume that the committer of the data is trusted, whereas in DAS, the committer can be adversarial. In comparison with PoR, DAS additionally requires that even when the commitment to the data is adversarially chosen, if a node can pass the audit, we must be able to extract some dataset consistent with the commitment by rewinding the node and feeding it with many different challenges.

Due to historical reasons, PoR was studied typically with the cloud setting in mind, where a client outsources a dataset to a powerful but untrusted cloud server capable of storing the entire dataset. By contrast, the DAS abstraction was proposed in a blockchain context, where blocks are proposed by untrusted block producers in the underlying the consensus protocol. Lightweight consensus clients want to ensure that the data block is available, without necessarily downloading the entire block. It is implicitly assumed that a separate DAS instance will be applied to each block being produced. Because block producers are untrusted, it is crucial that security hold even when the committer is malicious.

Clearly, neither PoR nor DAS solve our 𝗂𝖣𝖣𝖠 problem for two main reasons. First, the approach of spawning a separate instance of the scheme per block inherently requires each storage node to maintain space that is linear in the size of the database. This directly violates our “low barrier to entry” requirement. Second, neither PoR nor DAS provide replications security. Specifically, a node with ample space can store just one copy of the data, and yet pretend to be arbitrarily many pseudonyms and earn unbounded rewards.

Verifiable information dispersal.

In verifiable information dispersal (VID) [24, 33, 7], a potentially malicious party encodes some data string and distributes the encoded fragments among a set of nodes. At the end, the nodes can interact to determine whether the original block is available. VID’s security relies on a threshold number of honest nodes, making it unsuitable for a permissionless setting in which the adversary can control arbitrarily many pseudonyms. Like DAS, the VID abstraction was also proposed in a blockchain consensus context, where the committer is a possibly malicious block producer. Consequently, it is typically assumed that a separate VID instance is applied to disperse each block, thus leading to a linear space requirement per node.

Proofs of replication.

Proofs of replication (PoRep) [23, 22, 34] guarantee that if a node can pass the audit purporting to have allocated αn amount of space where n is the data size, then the following are guaranteed: 1) the node must indeed be consuming (αϵ)n amount of space; and 2) the space is indeed used to store useful data. Existing works on PoRep [23, 22, 34] typically consider a static database and assume that the storage node can store the entire database. PoRep (over unencoded data) also does not guarantee extraction of the entire data when the storage provider can pass the audit.

Polynomial commitment.

In our paper, we commit to some data string by computing an erasure code and a vector commitment over the erasure code. We use a constant redundancy for the erasure code (dependent on ϵ). This way, the publisher can simply cache all the opening proofs cheaply. This way, when an update occurs or upon first joining, the storage node only needs to download some data from the publisher, and the publisher need not perform any online computation.

Alternatively, we can use a polynomial commitment scheme with arbitrarily large redundancy (e.g., as large as the underlying field size) [31, 40, 8]. Unfortunately, this approach has the drawback that the publisher must compute the opening proofs on the fly, and in a non-batched setting, the computation cost is at least linear in the database size using known approaches [31, 40, 8]. In comparison, our approach has only BO~(1) amortized publisher cost per update.

2 Definitions

In this section, we formally define the 𝗂𝖣𝖣𝖠 problem and corresponding security definition. Throughout the rest of the paper, we use Σ to denote some finite alphabet. We will treat the database 𝖣𝖡Σn as containing n blocks, where each block belongs to Σ. We often use the notation B:=log2|Σ| to denote the block size. We treat B as a global parameter, so we do not carry it around in the definitions below.

2.1 Syntax of the iDDA Problem

Consider an evolving database whose length increases as new blocks get added over time. An incremental Decentralized Data Archival (𝗂𝖣𝖣𝖠) scheme is a suite of algorithms and protocols involving a data publisher, a set of storage nodes, and an auditor. We now describe the syntax below:

  • 𝖼𝗋𝗌𝐒𝐞𝐭𝐮𝐩(1λ): a setup algorithm that takes as input the security parameter 1λ and outputs a common reference string 𝖼𝗋𝗌.

  • (ϕ,𝗌𝗍0)𝐈𝐧𝐢𝐭(𝖼𝗋𝗌,S,𝖣𝖡): an initialization algorithm that is executed once upfront by the data publisher. The algorithm takes as input the common reference string 𝖼𝗋𝗌, a parameter S that denotes the blocks of space provisioned by each storage node, and an initial database 𝖣𝖡Σn0 of length n0. The algorithm outputs a public digest denoted ϕ and updates the data publisher’s internal state to 𝗌𝗍0.

  • (𝗌𝗍0,𝗌𝗍𝑖𝑑)𝐉𝐨𝐢𝐧(𝗌𝗍0,𝑖𝑑): a protocol between the data publisher whose starting state is 𝗌𝗍0 and a storage node with unique identifier 𝑖𝑑. At the end of the protocol, the storage node receives state 𝗌𝗍𝑖𝑑, and the data publisher’s state 𝗌𝗍0 is updated.

  • (ϕ,𝗌𝗍0,{𝗌𝗍𝑖𝑑}𝑖𝑑𝖨𝖣𝗌𝖾𝗍)𝐔𝐩𝐝𝐚𝐭𝐞((𝗌𝗍0,𝗎𝗉𝖽),{𝗌𝗍id}𝑖𝑑𝖨𝖣𝗌𝖾𝗍): a protocol between the data publisher whose current state is 𝗌𝗍0 and who receives an update 𝗎𝗉𝖽 as input, and a set of storage nodes whose identities form the set 𝖨𝖣𝗌𝖾𝗍 and whose current states are {𝗌𝗍id}𝑖𝑑𝖨𝖣𝗌𝖾𝗍. At the end of the protocol, the data publisher’s internal state 𝗌𝗍0 and the nodes’ internal states {𝗌𝗍𝑖𝑑}𝑖𝑑𝖨𝖣𝗌𝖾𝗍 are updated, and everyone receives an updated public digest ϕ.

  • b𝐀𝐮𝐝𝐢𝐭((𝖼𝗋𝗌,ϕ,𝑖𝑑),𝗌𝗍𝑖𝑑): a protocol between a storage node with identity 𝑖𝑑 and state 𝗌𝗍𝑖𝑑 and an auditor whose input is (𝖼𝗋𝗌,ϕ,𝑖𝑑). At the end of the protocol, the auditor outputs either 𝖺𝖼𝖼𝖾𝗉𝗍 or 𝗋𝖾𝗃𝖾𝖼𝗍, and the storage node outputs nothing, and its internal state is unchanged.

Without loss of generality, we may assume that the security parameter λ is included in 𝖼𝗋𝗌, in the data publisher’s internal state, and in each storage node’s internal state. In general, the auditor can be a different entity from the publisher. For example, as mentioned in Section 7.2 of the full version of the paper, in a blockchain context, the auditor is likely the blockchain such that nodes can get rewarded for passing the audit. On the other hand, the publisher is a separate service provider that maintains hash digests of some data structure built over the blockchain data. Section 7.2 of the full version describes how to remove the trust in the publisher using IVC.

In practice, it might be desirable to periodically invoke the 𝐀𝐮𝐝𝐢𝐭 protocol, and randomize the time at which it is invoked. This can thwart attacks where the adversary restores its state right before the 𝐀𝐮𝐝𝐢𝐭 and deletes the state again afterwards – see the end of Section 2.3 for more discussions.

Correctness.

We now define the correctness property of an 𝗂𝖣𝖣𝖠 scheme. At a high level, correctness means that even if there are corrupt nodes in the system, honest nodes can always pass the audit with probability 1. More formally, for any λ, S, any initial 𝖣𝖡, any adversary 𝒜, the following experiment outputs 1 with probability 1:

  • Initialize. Call 𝖼𝗋𝗌𝐒𝐞𝐭𝐮𝐩(1λ), and ϕ,𝗌𝗍0𝐈𝐧𝐢𝐭(𝖼𝗋𝗌,S,𝖣𝖡).

  • Queries. The adversary can adaptively issue the following queries, where each query is one of the following:

    • Corrupt. 𝒜 declares some identity 𝑖𝑑 as corrupt. If 𝑖𝑑 has already joined earlier, give its private state 𝗌𝗍𝑖𝑑 to 𝒜.

    • Join. 𝒜 specifies an identity 𝑖𝑑. If the identity 𝑖𝑑 has previously been corrupted, the publisher performs the 𝐉𝐨𝐢𝐧 protocol with 𝒜 who acts on behalf of 𝑖𝑑. Otherwise, spawn an honest identity 𝑖𝑑, and have the publisher perform the 𝐉𝐨𝐢𝐧 protocol with 𝑖𝑑. If 𝑖𝑑 has previously been spawned, simply ignore the existing state and respawn the node.

    • Update. 𝒜 specifies an updated block 𝗎𝗉𝖽, a set of identities 𝖨𝖣𝖲𝖾𝗍. It is required that 𝖨𝖣𝖲𝖾𝗍 is chosen among the identities that have joined, and moreover, all honest identities that have been joined must belong to 𝖨𝖣𝖲𝖾𝗍. Now, the 𝐔𝐩𝐝𝐚𝐭𝐞 protocol is invoked with 𝗎𝗉𝖽, involving the publisher and 𝖨𝖣𝖲𝖾𝗍. Note that 𝒜 will act on behalf of any corrupt identity in 𝖨𝖣𝖲𝖾𝗍.

  • Challenge. 𝒜 specifies an identity 𝑖𝑑 that has joined and remains honest. Now, the honest 𝑖𝑑 engages in an 𝐀𝐮𝐝𝐢𝐭 protocol with the auditor who receives the up-to-date ϕ. The experiment outputs 1 if the auditor accepts; else it outputs 0.

 Remark 2 (Additional desirable properties of our construction).

While the above definitions aim to be more general, the constructions proposed in this paper enjoy some additional nice properties: 1) our security notions defined below are respected even when the adversary can see the publisher’s state 𝗌𝗍0; 2) during the 𝐉𝐨𝐢𝐧 protocol, the newly joining node simply downloads some portion of the publisher’s state 𝗌𝗍0, and the 𝐉𝐨𝐢𝐧 protocol does not alter 𝗌𝗍0; and 3) during the 𝐔𝐩𝐝𝐚𝐭𝐞 protocol, the publisher updates its state 𝗌𝗍0 based on the incoming block, and each storage node then downloads some necessary parts of the new 𝗌𝗍0.

In particular, these desirable properties make it possible for us to instantiate the publisher without any trust by relying on an IVC scheme, provided that there is a trusted hash digest of the original dataset (e.g., coming from the blockchain’s consensus layer) – see Section 7 of the full version for more details.

We now proceed to the security definitions. Our security definition has two components: approximate best-possible recoverability and replication security. We describe both components in detail below.

2.2 Security Definition: Approximate Best-Possible Recoverability

2.2.1 Intuition of the Definition

Below, we will first define a security game denoted 𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍 that allows an adversary 𝒜 to interact with the 𝗂𝖣𝖣𝖠 scheme. After obtaining the common reference string 𝖼𝗋𝗌, the adversary 𝒜 is allowed to pick an initial database of its choice. Then, at any point in time, the adversary can 1) ask the challenger to run the 𝐉𝐨𝐢𝐧 protocol with any identity of its choice; and 2) ask the challenger to run the 𝐔𝐩𝐝𝐚𝐭𝐞 protocol (with all identities that have joined), supplying any new database update of its choice.

At the end of the security game, we enter a challenge phase. During the challenge phase, the adversary specifies μ challenge identities, and the auditor will run the 𝐀𝐮𝐝𝐢𝐭 protocol with these challenge identities. If all μ identities succeed in passing the audit, we want to extract a portion of the database whose size is roughly commensurate with the total space of all the μ nodes, that is, roughly (1ϵ)μS blocks of information. To capture this intuition, our definition below involves a compressor algorithm 𝒞𝒜 and an extractor 𝒜. Intuitively, the compressor’s job is to output the missing n(1ϵ)μS blocks of information – denoted 𝖣𝖡short – that 𝒜 may not know. Now, upon receiving this missing information 𝖣𝖡short, the extractor 𝒜 can extract the entire database. Both algorithms are allowed to interact with the oracle 𝒜, including rewinding 𝒜 and supplying it with fresh randomness on every invocation.

2.2.2 Formal Definition

Formally, given parameters λ,S,μ, define the following experiment 𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍𝒜(1λ,S,μ) between an adversary 𝒜 and a challenger:

Experiment 𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍𝒜(𝟏λ,S,μ):

  • Initialization. Run 𝖼𝗋𝗌𝐒𝐞𝐭𝐮𝐩(1λ). The adversary 𝒜(1λ,S,μ,𝖼𝗋𝗌) specifies an initial database 𝖣𝖡Σn0 of length n0, and the challenger runs ϕ,𝗌𝗍0𝐈𝐧𝐢𝐭(𝖼𝗋𝗌,S,𝖣𝖡) and sends ϕ to 𝒜.

  • Queries. The adversary 𝒜 can adaptively make Join and Update queries. In response, the challenger acts on behalf of an honest data publisher and always keeps track of the publisher’s latest state. More precisely:

    • Join: 𝒜 asks the challenger to run the 𝐉𝐨𝐢𝐧 protocol with itself. During this protocol, 𝒜 acts on behalf of the newly joining node, and it can act arbitrarily. This means that 𝒜 can also arbitrarily choose the identity of the newly joining node.

    • Update: 𝒜 chooses some update 𝗎𝗉𝖽 and asks the challenger to run the 𝐔𝐩𝐝𝐚𝐭𝐞 protocol with all the identities that have joined so far. The publisher uses its latest state and 𝗎𝗉𝖽 as input. 𝒜 acts on behalf of all the storage nodes and can behave arbitrarily during the protocol.

  • Challenge. At some point, suppose that the database size has increased to nn0 following the 𝐔𝐩𝐝𝐚𝐭𝐞 queries. Now, for j=1 to μ sequentially: the adversary 𝒜 specifies 𝑖𝑑j, and the challenger, acting on behalf of the auditor, initiates an 𝐀𝐮𝐝𝐢𝐭 instance using (𝖼𝗋𝗌,ϕ,𝑖𝑑j) as input. We say that the challenger accepts if it accepts in all 𝐀𝐮𝐝𝐢𝐭 instances.

Going forward, we will treat the above security experiment as occurring in the following two phases, and likewise we will consider the adversary to be of the form 𝒜=(𝒜1,𝒜2):

  • In the first phase, the challenger interacts with 𝒜1 for the Initialization and for all Queries. We denote 𝗌𝗍𝒜 to be 𝒜’s internal state at the end of this interaction.

  • In the second phase, the experiment enters the 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 phase, and the challenger interacts with 𝒜2(𝗌𝗍𝒜).

We are now ready to introduce the definition of approximate best-possible recoverability. Throughout, given algorithms and 𝒪, we use the notation 𝒪 to mean that the algorithm has oracle access to 𝒪.

Definition 3 (Approximate best-possible recoverability).

Let ϵ(0,1) where ϵ is possibly a function in the other parameters. We say that an 𝗂𝖣𝖣𝖠 scheme satisfies ϵ-best-possible-recoverability iff there exist a compression algorithm 𝒞, an extractor algorithm , and a quasi-polynomial function q(), such that

  • 𝒞’s output 𝖣𝖡short is at most max(B(n(1ϵ)Sμ),0) bits long, and ’s running time is upper bounded by q(λ,t𝒜) where t𝒜 denotes 𝒜’s maximum running time.

  • for any non-uniform deterministic polynomial-time algorithm 𝒜=(𝒜1,𝒜2), any S and μ, there exists a negligible function 𝗇𝖾𝗀𝗅() such that for every λ,

    Pr[(𝖣𝖡ext𝖣𝖡)(b=1)|b,ϕ,𝖣𝖡,𝗌𝗍𝒜,𝑡𝑟𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍𝒜(1λ,S,μ)ρ${0,1}|ρ| 𝖣𝖡short𝒞𝒜(1λ,S,μ,𝑡𝑟;ρ)𝖣𝖡ext𝒜2(𝗌𝗍𝒜)(1λ,n,S,μ,ϕ,𝖣𝖡short;ρ) ]𝗇𝖾𝗀𝗅(λ),

    where ρ denotes the random coins consumed by the extractor , which is also shared with the compressor 𝒞.

In the above, we use the notation b,ϕ,𝖣𝖡,𝗌𝗍𝒜,𝑡𝑟𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍𝒜(1λ,S,μ) to mean the following: execute the experiment 𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍𝒜(1λ,S,μ) with 𝒜, and

  • let b denote whether the challenger accepts at the end;

  • let ϕ denote the digest and let 𝖣𝖡 be the correct database as we enter the 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 phase – specifically, ϕ and 𝖣𝖡 can be computed from the initial database 𝖣𝖡 and the sequence of updates submitted by 𝒜 during the Initialization and Query phases;

  • let 𝗌𝗍𝒜 be 𝒜’s internal state as we enter the 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 phase; and

  • let 𝑡𝑟 be all of the random coins consumed by the challenger.

Intuitively, the definition means that if the adversary 𝒜 is able to successfully pass the audit and get remuneration on behalf of μn(1ϵ)S storage nodes, it must have knowledge of at least (1ϵ)μS blocks of useful information. This information, when combined with an additional n(1ϵ)μS blocks of information output by the compressor 𝒞𝒜, is sufficient for reconstructing the entire database.

Important special case: recover entire database.

When μSn/(1ϵ), the definition implies that we must be able to extract the entire database from the μ identities that can pass the audit. In this special case, the compressor 𝒞𝒜’s output is forced to be empty by the definition.

2.2.3 Discussions: Can We Achieve Polynomial-Time Extractability?

Definition 3 allows the extractor to be quasi-polynomial time. One meaningful question is whether we can get polynomial-time extraction – if so, we can simply run the extractor to recover the dataset in practice. We stress that although many works in the proof-of-retrievability [30] and data availability sampling [27] literature claimed to achieve polynomial-time extraction, their extractability notions are much weaker than ours (Definition 3). If we also adopt the same relaxations to match the nature of the definitions in prior work, we can easily achieve polynomial-time extraction too without rewinding the adversary – details are provided in the appendix of the full version. For this reason, we do not explicitly define reconstruction in the honest algorithms, since one can just run this extractor to reconstruct.

However, since our paper is aiming to lay the definitional groundwork of decentralized data archival, we choose to take a step back, elucidate the definitional subtleties, and rethink what is the right notion. We believe that our Definition 3 would actually be the desired notion – while our current proof needs a quasi-polynomial time extractor, we leave it as an interesting open question whether it is possible to achieve polynomial-time extraction under our notion. In particular, we believe that adapting the techniques of Attema et al. [5] to our setting is a promising direction for improving the extractor to expected polynomial time.333It is also meaningful to compare with the knowledge soundness extractors in the proof systems literature. The extraction in proofs of retrievability [30], data availability sampling [27], and in our work can be viewed as a special case of the knowledge extractor of Kilian [15]. In particular, since our proofs do not involve proving computation, the Probabilistic Checkable Proofs (PCP) in Kilian is replaced with a simpler erasure code. Note strictly polynomial-time extraction for Kilian is not known, and there are reasons to believe that it is impossible [16].

We now explain why the relaxations adopted in the prior literature on proofs of retrievability and data availability sampling are not ideal, even though they make polynomial-time extraction possible. Take the work of Hall-Andersen et al. [27] as an example. The differences in the notions are explained below:

  • The soundness notion of Hall-Andersen et al. [27] (see page 8 of their paper) says that if we invoke the adversary in the audit protocol polynomially many times, then the following holds with probability close to 1: if the adversary succeeds in all of the sessions, then we can extract the dataset from the transcripts collected from all these invocations (and there is no need to rewind the adversary). Their definition is weak in the following sense. Consider an occasionally successful adversary that succeeds in passing the audit with constant probability. Then, the probability that the adversary succeeds in all polynomially many invocations is negligibly small. In other words, their soundness definition does not provide extractability from such an occasionally successful adversary. To mitigate this drawback, they provide a slightly strengthened definition called subset soundness. Even subset soundness makes a strong assumption on the adversary – it essentially assumes that with probability 1, the adversary will succeed in at least out of L invocations of audit, where and L are a-priori known polynomials.

  • Our notion (Definition 3) follows the definitional paradigm of the knowledge soundness property in the standard zero-knowledge literature. We do not make any a-priori assumption on the success probability of the adversary. To explain our definition, below we focus on the special case where we want to extract the entire database, and we assume that μS(1+ϵ)n, i.e., the number of adversarial nodes μ is large enough such that the combined space of μ nodes can store the entire dataset. Our definition says that the following holds with probability close to 1: as long as all μ adversarial nodes succeed in passing the audit once, we can extract the entire dataset by rewinding the adversary. We stress that in our definition, the rewinding is necessary because the transcript length of a single invocation (with all μ nodes) is not long enough to encode the entire dataset, and we cannot simply extract from the transcript with just one invocation.

Another way to understand the comparison is that Hall-Andersen et al. [27]’s definition essentially bakes the extractor into the definition (by invoking the adversary in an audit polynomially many times), and they achieve polynomial-time extraction simply by making a strong assumption on the adversary, i.e., assuming that the adversary will succeed enough times after an a-priori known polynomial number of invocations. In the appendix of the full version, we explain how our proofs can be easily extended to show polynomial-time extraction under the same relaxations as in Hall-Andersen et al. [27].

Besides Hall-Andersen et al. [27], other works in this body of literature [30] make some different and non-standard assumptions on the adversary to achieve polynomial-time extraction.

2.3 Security Definition: Replication Security

Let 𝒜=(𝒜1,𝒜2) denote the adversary’s algorithm, where 𝒜1 participates in the
𝐈𝐧𝐢𝐭𝐢𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧 and 𝐐𝐮𝐞𝐫𝐲 phases of the 𝖯𝗈𝖱𝖾𝗉𝖤𝗑𝗉𝗍 to be defined, and 𝒜2 participates in the 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 phase. Now, we define the following security experiment.

Experiment 𝖯𝗈𝖱𝖾𝗉𝖤𝗑𝗉𝗍𝒜(𝟏λ,S,μ):

  • Initialization. Same as in the 𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍 experiment earlier, where 𝒜1(1λ,S,μ) interacts with the challenger.

  • Queries. Same as in the 𝖱𝖾𝖼𝗏𝖤𝗑𝗉𝗍 experiment earlier, where 𝒜1 continues to interact with the challenger. At the end of the query phase, 𝒜1 outputs some state 𝗌𝗍𝒜 to be passed to 𝒜2.

  • Challenge. 𝒜2 receives 𝗌𝗍𝒜 as input, and outputs μ challenge identities 𝑖𝑑1,,𝑖𝑑μ. The challenger picks a random 𝑖𝑑{𝑖𝑑1,,𝑖𝑑μ}, and invokes an 𝐀𝐮𝐝𝐢𝐭 instance with 𝒜2, which acts on behalf of identity 𝑖𝑑. The adversary is said to win this game if it passes the audit.

Definition 4 (Replication security).

Let ϵ(0,1). We say that a decentralized data archival scheme satisfies ϵ-replication-security iff for any non-uniform deterministic polynomial-time adversary 𝒜=(𝒜1,𝒜2) such that 𝒜1 is restricted to outputting a state 𝗌𝗍𝒜 of space at most μαBS, then the probability that 𝒜 wins the above 𝖯𝗈𝖱𝖾𝗉𝖤𝗑𝗉𝗍𝒜(1λ,S,μ) game is at most αϵ.

Intuition.

Intuitively, replication security says that if an adversary wants to get α times the fair reward in expectation, it must be consuming roughly α times the space, and this must hold even when the original data itself is compressible. We now elaborate on how to understand the definition. Suppose the adversary dedicates αμS blocks of space where S is the space of an honest node. We expect that in every period, the adversary should get roughly αμ times the fair reward. However, the adversary can allocate αμS blocks of space among its μ nodes in various ways: it can allocate S space for αμ nodes and 0 space for the remaining nodes, but it can also equally allocate αS space to each of the μ nodes. Regardless, the adversary should not get noticeably more than αμ times the fair reward. Our replication security definition captures this intuition by randomly sampling a challenge node among the μ specified nodes, which is effectively an “averaging argument” to capture the idea that “no matter how the adversary allocates its space, it should get no more than αμ times the fair reward in expectation”. Another reason why we randomly sample a challenge node in the definition rather than challenging all of them is because in practice, it may make sense to randomly sample the time at which each node is challenged. If the challenge times are known ahead of time, then the adversary can reconstruct the storage right before the challenge and delete the space afterwards.

Our replication security notion implies an ϵ-Nash-equilibrium, that is, a node with a fixed amount of space cannot get ϵ fraction more rewards than behaving honestly and contributing all of its space towards archiving the dataset. Because honest behavior is an equilibrium, it implies that that in the equilibrium state, any node claiming rewards commensurate with contributing αn blocks of space is actually storing α copies of the database.

3 Informal Technical Roadmap

In this section, we give an informal description of the ideas behind our construction. A formal description is provided in the full version of the paper.

To understand the technicalities, let us begin with a couple strawman solutions, and gradually work our way towards the final solution.

3.1 Inefficient Strawman

We begin with a strawman scheme that indeed satisfies the desired security notions, despite being inefficient.

Underlying static construction.

First, if the database were static, we can employ the following idea.

  • Preprocess and publish. The trusted data publisher computes an erasure code over the original database 𝖣𝖡, resulting in 𝖣𝖡¯. It then computes and publishes the Merkle digest denoted ϕDB of 𝖣𝖡¯.

  • Store. Every storage node with roughly S blocks of space uses a random oracle G(𝑖𝑑,) to sample S indices. It then downloads 𝖣𝖡¯[S] along with the relevant Merkle opening proofs, and computes a replication encoding of 𝖣𝖡¯[S] henceforth called the node’s shard. Specifically, we will use the PoRep scheme of Pietrzak [34]. The resulting replication encoding has S blocks, and it provides the guarantee that the encoding is incompressible if the node later wants to pass the audit within bounded time.

    The node additionally computes a corresponding Merkle digest ϕshard of its replication-encoded shard, and a succinct correct proof π attesting to the fact that ϕshard is computed correctly w.r.t. G(𝑖𝑑,) and ϕDB. Finally, the node stores the following information:

    1. 1.

      its shard, the shard’s Merkle digest ϕshard, as well as a proof of correctness π of the digest ϕshard, and

    2. 2.

      the Merkle openings of all (replication-encoded) blocks in the shard.

  • Audit. An auditor samples κ=ω(logλ) random challenge indices among [S] and asks the storage node to open the replication-encoded blocks at these positions. The node responds with the challenged positions, along with ϕshard, π, and the Merkle opening proofs of the opened positions w.r.t. ϕshard. The auditor accepts if π verifies and all Merkle opening proofs verify.

Throughout the paper, we assume that the block size B is at least polylogarithmically large, such that the space required for storing metadata (e.g., digests and opening proofs) is absorbed by the block storage.

 Remark 5 (Regarding the succinct proof of correctness).

The aforementioned succinct proof of correctness can be computed using a Succinct Non-interactive ARgument of Knowledge (SNARK) scheme. The SNARK is undesirable not only due to the extra computational costs, but also because we need a SNARK proof over computations that involve random oracle queries. Recent work [6] has shown that constructing such a relativized SNARK is impossible. We discuss how to get rid of the SNARK proof in Section 3.4.

Making it dynamic: inefficient approach.

Now, if the database is evolving over time, the most naïve approach is to rerun the static scheme from scratch with every update. The resulting scheme indeed satisfies ϵ-best-possible recoverability as well as ϵ-replication-security for an arbitrarily small ϵ(0,1). Unfortunately, for each update, the scheme would incur O~(BN) cost for the publisher and O(BS) cost for each storage node, where N is the maximum data size.

We ask the natural question: can we reduce the cost per update to O~(1)B ?

3.2 How to Apply a Hierarchical Data Structure: the Space Allocation Problem

To answer the above question, the first idea that comes to mind is to apply the hierarchical data structure of Bentley and Saxe [9], which turns a static data structure into a dynamic one. This approach also draws inspiration from prior works on dynamic proofs of retrievability [38, 13].

Unfortunately, the hierarchical data structure does not generically work for any cryptographic scheme. In particular, as explained below, both our security definitions and our underlying cryptographic building blocks lack the appropriate compositional guarantees needed for a blackbox application of the hierarchical data structure. To see this, it helps to go over a couple strawman ideas.

Review: publisher’s hierarchical data structure.

With Bentley and Saxe’s techniques [9], the publisher can maintain a hierarchy of levels numbered 0,1,,L=O(logN), respectively, where each level {0,1,,L} is either an erasure code of 2 blocks, or empty.

Every time a new data block arrives, we find the smallest empty level denoted , and merge all smaller levels as well as the newly arriving block into level , by recomputing an erasure code of these blocks. The levels smaller than now become empty. Further, suppose that the data publisher publishes a Merkle digest of each level. Assuming that the block size B is larger than the size of a Merkle opening proof, and the erasure code has constant rate, then the amortized publisher cost for maintaining this hierarchical data structure can be as small as O(BlogN) if we use a special updatable erasure code proposed in prior work [38].

The space allocation question.

The idea is for each storage node with approximately S blocks of space to choose s random erasure-coded blocks per non-empty level for its shard, such that s=S. As before, this selection can be made with the help of a random oracle G(𝑖𝑑,). During the audit, the auditor will challenge μ=ω(logλ) random indices per level.

The challenging question is: how do we allocate the local space S among the levels? To understand the subtleties here, it helps to look at a couple naïve approaches:

  • (Idea 1) Uniform sampling rate: prone to selective censorship. The first idea is to use a uniform sampling rate. Specifically, let p=S/n where S is the local space provisioning and n is the current database size – for a growing database, we can recompute the sampling rate p whenever the database size has reached (1+o(1))nprev where nprev denotes the database size when p was last calculated. Now, a node would sample every block in any level with uniform probability p, so that in expectation, it will get S blocks for its shard.

    Unfortunately, this natural approach is fundamentally flawed. Under this approach, a node would end up sampling Θ(S) blocks in expectation for the largest level. However, for any constant-sized level (say, level 0), the fraction of nodes required to store some block of that level is only Θ(S/n). This means that if the adversary selectively deletes the small fraction of identities assigned to store some block of level 0, it can completely wipe out the data belonging to level 0! Another way to think of the same attack is the following. As mentioned, it could be that the adversary controls all the pseudonyms that are contributing and requesting rewards. In this case, if the adversary chooses only identities that are not assigned any block of level 0 (e.g., through rejection sampling), it can successfully wipe out level 0 altogether while still being able to pass the audits.

  • (Idea 2) Same number of blocks per level: causes space waste. To fix the problem with idea 1, we can increase the sampling rate of the smaller levels. A natural idea is to sample the same number of blocks per level regardless of the level’s size. In other words, a node samples s=S/L^(n) blocks for every non-empty level, where L^(n) denotes the number of non-empty levels under a size-n database. Unfortunately, this approach suffers from a different problem partly because the smaller levels effectively replicate their data much more abundantly than the larger levels, leading to a waste problem. In particular, ϵ-best-possible recoverability requires that when μ nodes pass the audit and the total space provisioned satisfies μS(1+ϵ)n, we should be able to extract the entire dataset. However, when μ is chosen to ensure recoverability of the largest level L, i.e., μs(1+ϵ)2L, the total space provisioned μS could exceed the data size by a logarithmic factor, that is, Sμ=Ω(nlogn). In other words, this approach would require a total space provisioning of Θ(nlogn) rather than (1+ϵ)n to recover a database of size n.

More fundamentally, Idea 2 fails partly because our security definition itself lacks compositional guarantees. Specifically, Idea 2 can in fact be shown to satisfy ϵ-best-possible recoverability for each individual level, as long as we select the parameter S and the rate of the erasure code satisfy some mild assumptions. Unfortunately, it is not the case that if each individual level satisfies ϵ-best-possible recoverability, the union of all levels would also satisfy the same notion. Partly, this is because the number of nodes μ needed for the space provisioning per level to roughly match the level’s size can vary across the levels!

A hybrid space allocation scheme.

We resolve the aforementioned challenges by using a hybrid of the aforementioned ideas. Specifically, we divide the levels into two categories: the biggest 2loglogn levels numbered L2loglogn+1 to L are called the large levels, and all remaining levels are called the small levels. In our formal description later, we generalize the parameter 2loglogn to more general forms, but for clarity, we simply use 2loglogn in the informal roadmap. By renaming, we may assume that each node has (1+o(1))S blocks of space rather than S for some suitable sub-constant function o(1). We will allocate the (1+o(1))S blocks of space among the levels as follows:

  • Large levels. We dedicate S blocks of space in aggregate to the large levels. Among the large levels, we adopt a uniform sampling rate. In other words, a large level i+1 occupies twice as much space as level i.

  • Small levels. We dedicate So(1) blocks of space in aggregate to the small levels. Further, this space is divided evenly across the small levels.

This hybrid approach achieves the best of both worlds. First, by having a higher sampling rate in the smaller levels, we prevent the aforementioned selective censorship attack. Second, since nearly all the space is allocated to the large levels, the space waste caused by the smaller levels is restricted to So(1). Moreover, the o(1) factor loss can be absorbed into the ϵ-slack already permitted in our security definition.

One remaining technicality is how to formally argue resilience against the selective censorship attack mentioned earlier. In particular, if for some level s, each storage node samples only s=1 block to store, then an adversary can carefully select a set of identities that cause 2/3 of the (erasure-coded) blocks to be lost. Specifically, the adversary can use rejection sampling to select only identities that are assigned a block from the first 1/3. Intuitively, when s is larger, erasing 2/3 of the blocks appears much harder, since only a negligible fraction of 𝑖𝑑s avoid hitting any specific 1/3 of the blocks.

To formalize this intuition, we consider an adversary that can make at most polynomially many queries to G(𝑖𝑑,). After these queries, it selects a subset of μ queried identities denoted 𝑖𝑑1,,𝑖𝑑μ to minimize the union G(𝑖𝑑1,)G(𝑖𝑑μ,). We prove that with all but negligible probability, |G(𝑖𝑑1,)G(𝑖𝑑μ,)|min(2,(1ϵ)sμ), as long as i) the number of blocks sampled s is at least polylogarithmically large, and ii) the redundancy (i.e., inverse rate) of the erasure code is a suitably large constant. In particular, this implies that as long as (1ϵ)sμ2, then the union of any μ nodes’ shards are sufficient for reconstructing level which contains 2 blocks – even when the identities are adversarially chosen.

Of course, our actual proof is more complicated than the above. In the actual proof, the above combinatorial reasoning is embedded in some extraction argument that takes into account the fact that even when a node passes the audit, it may not be storing all blocks belonging to its shard. We defer the details to Section 9 of the full version.

The careful reader may also observe that a straightforward instantiation of our hybrid space allocation idea would incur roughly O~(S)B amortized cost per update for a storage node. However, later in Section 3.4, we will discuss some additional tricks that bring this cost down to O~(1)B.

3.3 Handling Compositional Challenges

Lack of composition in the underlying replication code.

Although the new hybrid space allocation appears to address the aforementioned issues, we are not aware of any method to formally prove its security. Specifically, one important challenge we face is that the underlying replication encoding scheme [34] lacks compositional guarantees.

Pietrzak [34] only proved the incompressibility (subject to answering challenges quickly) of his replication encoding scheme in a standalone setting. More specifically, imagine that there is a single database, and we use the digest of the database to seed the hash function used to compute the replication code. Then, the resulting replication code is incompressible subject to answering challenges quickly.

In our application, there is a separate instance of the replication encoding per level. The most straightforward approach is for each level to use the level’s own digest (along with the storage node’s 𝑖𝑑) to seed its own hash function. With this approach, however, the adversary in our security experiment would be able to choose the data contents of the smaller levels to depend on the replication codes of the larger levels. This is because in our security experiment, the adversary chooses the updates adaptively over time. So when it chooses the blocks that go into the smaller levels, the replication codes of the larger levels are already available.

Unfortunately, Pietrzak’s proofs [34] fall apart in such a setting when multiple instances are composed and some instances’ data can depend on other instances’ replication code. Upon a closer examination, Pietrzak’s proof has the following blueprint – henceforth let 𝒜=(𝒜1,𝒜2) where 𝒜1 is the adversary that outputs some database 𝖣𝖡 and an adversarial replication encoding denoted 𝗌𝗍𝒜 of 𝖣𝖡, and 𝒜2(𝗌𝗍𝒜) is the adversary interacting with the auditor.

  • First, he shows that if 𝒜2(𝗌𝗍𝒜) can answer challenges quickly, then there is a winning strategy to an underlying pebbling game. Specifically, by placing some initial labels on the depth-robust graph, the adversary can pebble almost all vertices in a small number of steps.

  • Second, he analyzes the underlying pebbling game and argues that for any winning strategy, the number of initial pebbles must be large.

  • Finally, he shows that if 𝗌𝗍𝒜 is short, and the number of initial pebbles is large, then one can construct an encoding scheme to compress the random oracle H(ϕDB,) where ϕDB is the digest of the challenge database, thus contradicting Shannon’s theorem that random strings are incompressible. Intuitively, every initial pebble corresponds to a location 𝒜2(𝗌𝗍𝒜) can predict in H(ϕDB,). As a result, we need not record these predicted positions in the encoded string, thus achieving compression. The encoded string consists of the short 𝗌𝗍𝒜, H(ϕDB,) at all non-predicted locations, and a small amount of metadata needed for extracting from 𝒜2(𝗌𝗍𝒜) the predicted locations.

The subtlety in the proof lies with the decoder. For technical reasons, the decoder needs to know the database to successfully extract H(ϕDB,) at the predicted locations. In a standalone setting, before running 𝒜2 to perform decoding, the decoder first runs 𝒜1 till the point it first submits ϕDB to extract the database – henceforth this is called the preparation stage. If the digest ϕDB is also computed from a random oracle, then except with negligible probability, 𝒜1 cannot have queried H(ϕDB,) yet before revealing the database. In our composed setting, the same proof strategy fails, since 𝒜1 will submit the different level’s data incrementally. This means that before submitting the data in level 1, 𝒜1 may already start querying H(ϕ0DB,) yet where ϕ0DB is the digest of the 0-th level. Unfortunately, the decoder has no way of answering queries to H(ϕ0DB,) yet in the preparation stage, without having run 𝒜2 to perform the decoding.

Although the literature also comes with some other replication encoding candidates [23, 22], to the best of our knowledge, no known scheme provides the compositional guarantees we desire.

Our idea.

One way to solve the problem is to devise a replication coding scheme with the desired compositional guarantees, where the data of some instances may depend on the replication code of other instances. Unfortunately, we are not aware of any existing tools that can be used to achieve this: existing techniques for proving space-time tradeoffs through direct incompressibility arguments are highly involved and apply to extremely limited settings [19, 20]. While some other proof techniques [39, 18, 1, 25, 2] have been shown to prove space-time tradeoffs, they do not produce meaningful results in our setting.

Fortunately, we devise a method that side-steps this problem. Whenever a new erasure-coded level is rebuilt in the hierarchical data structure, we ask a storage node to recompute its replication codes for all levels, using the union of all levels’ digests (ϕ0,,ϕL) along with the node’s 𝑖𝑑 as the seed to the random oracle.

At first sight, this approach comes with additional computational overhead on the storage node. Specifically, the computational cost per update is at least SB. However, in Section 3.4, we discuss additional tricks to asymptotically reduce the computational overhead, and achieve O~(1)B amortized cost per update.

3.4 Further Improvements

Achieving 𝑶~(𝟏)𝑩 amortized cost for a storage node.

So far, we have a candidate scheme but each storage node must pay at least SB download bandwidth and computational cost per update. We propose a couple additional tricks to bring this cost down to O~(1)B. For example, when S is roughly n (see also Section 7.1 of the full version), these tricks bring significant cost savings. We stress that it is inherently unavoidable that each node must pay at least constant cost per update subject to our definitions – in this sense, our update costs are nearly optimal. Specifically, if 1% of the nodes do nothing upon an update, it means that they have no information about the new block. Now, if the adversary selectively corrupts the 99% remaining nodes, it can selectively erase this new block from the universe even if the 1% nodes’ combined storage is already large enough to store the entire dataset.

  1. 1.

    Small {tiny, mini, small}: We further divide the small levels into tiny, mini, and small levels. For the tiny levels each containing at most κ=ω(logλ) blocks, the node simply stores the original data blocks (without any encoding), and all of the blocks are challenged during the audit. For the mini levels each containing between κ and Θ(S/log2n) blocks, the node stores all encoded blocks belonging to the level (without using the random oracle G to subsample), and κ=ω(logλ) of them will be challenged during the audit. The small levels are treated in the same way as before.

  2. 2.

    Seed with an aggregate digest only for the large levels: Instead of making all levels’ replication codes use the aggregate digest (ϕ0,,ϕL) as the seed, we have only the large levels use the aggregate digest {ϕ}large, and the small levels use the level’s own digest as the seed. Recall that the large levels occupy 1o(1) fraction of the local space, this modification does not affect our ϵ-replication security since the o(1) loss can be absorbed into the arbitrarily small constant slack ϵ(0,1).

A more detailed description and analysis of these optimizations are provided in Section 6.2 of the full version.

Removing the SNARK proof.

Recall that so far, in our underlying static construction, we rely on a SNARK proof to attest to the correctness of the shard’s digest ϕshard w.r.t. the database’s digest ϕDB and the indices selected by the shard G(𝑖𝑑,). Using a SNARK, however, comes with a couple drawbacks. First, it incurs additional costs of cryptographic computation. Second, since the replication encoding is computed using a random oracle (RO), we would end up computing a SNARK over computations that involve RO queries – this is undesirable due to impossibilities of relativized succinct arguments in the RO model.

In our final construction, we avoid this SNARK proof by checking correctness at a few challenged locations, resulting in a proof of approximate correctness (rather than strict correctness). Further, we make our approach non-interactive by having the node sample the challenges itself through a random oracle (commonly known as the Fiat-Shamir paradigm). The fact that we only have approximate correctness introduces additional technicalities in our proof. Note that in comparison, Pietrzak’s proof works only if the encoder is honest. Nonetheless, we show that approximate correctness is sufficient for proving our security notions. We defer the details to the subsequent formal technical sections.

Extensions.

In Section 7 of the full version, we discuss a couple of extensions. Specifically, we show how to support non-uniform space provisioning among nodes, ensuring that a node with k times the space need not incur k times the update and audit costs. We also show when provided with a trusted hash digest of the original dataset ϕorig (e.g., from the blockchain’s consensus layer), how to instantiate the publisher without any trust, by relying on an Incremental Verifiable Computation (IVC) scheme to incrementally compute succinct proofs that vouch for the correctness for the digests of the erasure-coded hierarchical data structure.

References

  • [1] Akshima, David Cash, Andrew Drucker, and Hoeteck Wee. Time-space tradeoffs and short collisions in merkle-damgard hash functions. In CRYPTO, 2020.
  • [2] Akshima, Siyao Guo, and Qipeng Liu. Time-space lower bounds for finding collisions in merkle-damgård hash functions. In Advances in Cryptology - CRYPTO 2022 - 42nd Annual International Cryptology Conference, CRYPTO 2022, Santa Barbara, CA, USA, August 15-18, 2022, Proceedings, Part III, volume 13509 of Lecture Notes in Computer Science, pages 192–221. Springer, 2022. doi:10.1007/978-3-031-15982-4_7.
  • [3] Mustafa Al-Bassam, Alberto Sonnino, Vitalik Buterin, and Ismail Khoffi. Fraud and data availability proofs: Detecting invalid blocks in light clients. In Financial Cryptography and Data Security: 25th International Conference, FC 2021, Virtual Event, March 1–5, 2021, Revised Selected Papers, Part II, pages 279–298, Berlin, Heidelberg, 2021. Springer-Verlag. doi:10.1007/978-3-662-64331-0_15.
  • [4] Giuseppe Ateniese, Seny Kamara, and Jonathan Katz. Proofs of storage from homomorphic identification protocols. In ASIACRYPT, 2009.
  • [5] Thomas Attema, Michael Klooß, Russell W. F. Lai, and Pavlo Yatsyna. Adaptive special soundness: Improved knowledge extraction by adaptive useful challenge sampling. IACR Cryptol. ePrint Arch., page 2038, 2024. URL: https://eprint.iacr.org/2024/2038.
  • [6] Annalisa Barbara, Alessandro Chiesa, and Ziyi Guan. Relativized succinct arguments in the ROM do not exist. IACR Cryptol. ePrint Arch., page 728, 2024. URL: https://eprint.iacr.org/2024/728.
  • [7] Jeb Bearer, Benedikt Bünz, Philippe Camacho, Binyi Chen, Ellie Davidson, Ben Fisch, Brendon Fish, Gus Gutoski, Fernando Krell, Chengyu Lin, Dahlia Malkhi, Kartik Nayak, Keyao Shen, Alex Xiong, Nathan Yospe, and Sishan Long. The espresso sequencing network: HotShot consensus, tiramisu data-availability, and builder-exchange. Cryptology ePrint Archive, Paper 2024/1189, 2024. URL: https://eprint.iacr.org/2024/1189.
  • [8] Eli Ben-Sasson, Iddo Bentov, Yinon Horesh, and Michael Riabzev. Fast reed-solomon interactive oracle proofs of proximity. In 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, volume 107 of LIPIcs, pages 14:1–14:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPIcs.ICALP.2018.14.
  • [9] Jon Louis Bentley and James B Saxe. Decomposable searching problems i. static-to-dynamic transformation. Journal of Algorithms, 1(4):301–358, 1980. doi:10.1016/0196-6774(80)90015-2.
  • [10] Dan Boneh, Joachim Neu, Valeria Nikolaenko, and Aditi Partap. Data availability sampling with repair. Cryptology ePrint Archive, Paper 2025/1414, 2025. URL: https://eprint.iacr.org/2025/1414.
  • [11] Kevin D. Bowers, Ari Juels, and Alina Oprea. Proofs of retrievability: theory and implementation. In CCSW, 2009.
  • [12] Christian Cachin and Stefano Tessaro. Asynchronous verifiable information dispersal. In Proceedings of the 19th International Conference on Distributed Computing, DISC’05, pages 503–504, Berlin, Heidelberg, 2005. Springer-Verlag. doi:10.1007/11561927_42.
  • [13] Nishanth Chandran, Bhavana Kanukurthi, and Rafail Ostrovsky. Locally updatable and locally decodable codes. In Theory of Cryptography - 11th Theory of Cryptography Conference, TCC 2014, San Diego, CA, USA, February 24-26, 2014. Proceedings, volume 8349 of Lecture Notes in Computer Science, pages 489–514. Springer, 2014. doi:10.1007/978-3-642-54242-8_21.
  • [14] Arunima Chaudhuri, Sudipta Basak, Csaba Kiraly, Dmitriy Ryajov, and Leonardo Bautista-Gomez. On the design of ethereum data availability sampling: A comprehensive simulation study, 2024. doi:10.48550/arXiv.2407.18085.
  • [15] Alessandro Chiesa, Marcel Dall’Agnol, Ziyi Guan, Nicholas Spooner, and Eylon Yogev. Untangling the security of kilian’s protocol: Upper and lower bounds. In Theory of Cryptography: 22nd International Conference, TCC 2024, Milan, Italy, December 2–6, 2024, Proceedings, Part I, pages 158–188, Berlin, Heidelberg, 2024. Springer-Verlag. doi:10.1007/978-3-031-78011-0_6.
  • [16] Alessandro Chiesa, Ziyi Guan, and Yuetian Wu. Unpublished manuscript. Private conversation with Ziyi Guan, 2025.
  • [17] Seunghyun Cho, Eunyoung Seo, and Young-Sik Kim. Locally recoverable data availability sampling. Cryptology ePrint Archive, Paper 2025/1851, 2025. URL: https://eprint.iacr.org/2025/1851.
  • [18] Sandro Coretti, Yevgeniy Dodis, Siyao Guo, and John P. Steinberger. Random oracles and non-uniformity. In EUROCRYPT (1), pages 227–258. Springer, 2018. doi:10.1007/978-3-319-78381-9_9.
  • [19] Anindya De, Luca Trevisan, and Madhur Tulsiani. Time space tradeoffs for attacks against one-way functions and prgs. In Advances in Cryptology – CRYPTO 2010, pages 649–665, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. doi:10.1007/978-3-642-14623-7_35.
  • [20] Yevgeniy Dodis, Siyao Guo, and Jonathan Katz. Fixing cracks in the concrete: Random oracles with auxiliary input, revisited. In EUROCRYPT (2), pages 473–495. Springer, 2017. doi:10.1007/978-3-319-56614-6_16.
  • [21] Yevgeniy Dodis, Salil P. Vadhan, and Daniel Wichs. Proofs of retrievability via hardness amplification. In Theoretical Cryptography Conference (TCC), 2009.
  • [22] Ben Fisch. Poreps: Proofs of space on useful data. IACR Cryptol. ePrint Arch., page 678, 2018. URL: https://eprint.iacr.org/2018/678.
  • [23] Ben Fisch. Tight proofs of space and replication. In Advances in Cryptology – EUROCRYPT 2019: 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, May 19–23, 2019, Proceedings, Part II, pages 324–348, Berlin, Heidelberg, 2019. Springer-Verlag. doi:10.1007/978-3-030-17656-3_12.
  • [24] Ben Fisch, Arthur Lazzaretti, Zeyu Liu, and Lei Yang. Permissionless verifiable information dispersal (data availability for bitcoin rollups). Cryptology ePrint Archive, Paper 2024/1299, 2024. URL: https://eprint.iacr.org/2024/1299.
  • [25] Nick Gravin, Siyao Guo, Tsz Chiu Kwok, and Pinyan Lu. Concentration bounds for almost k-wise independence with applications to non-uniform security. In Proceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’21, pages 2404–2423, USA, 2021. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611976465.143.
  • [26] Yanpei Guo, Alex Luoyuan Xiong, Wenjie Qu, and Jiaheng Zhang. Data availability for thousands of nodes. Cryptology ePrint Archive, Paper 2025/865, 2025. URL: https://eprint.iacr.org/2025/865.
  • [27] Mathias Hall-Andersen, Mark Simkin, and Benedikt Wagner. Foundations of data availability sampling. IACR Commun. Cryptol., 1(4):34, 2024. doi:10.62056/A09QUDHDJ.
  • [28] Mathias Hall-Andersen, Mark Simkin, and Benedikt Wagner. FRIDA: data availability sampling from FRI. In Advances in Cryptology - CRYPTO 2024 - 44th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2024, Proceedings, Part VI, volume 14925 of Lecture Notes in Computer Science, pages 289–324. Springer, 2024. doi:10.1007/978-3-031-68391-6_9.
  • [29] James Hendricks, Gregory R. Ganger, and Michael K. Reiter. Verifying distributed erasure-coded data. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, PODC ’07, pages 139–146, New York, NY, USA, 2007. Association for Computing Machinery. doi:10.1145/1281100.1281122.
  • [30] Ari Juels and Burton Kaliski. Pors: proofs of retrievability for large files. In ACM CCS, 2007.
  • [31] Aniket Kate, Gregory M. Zaverucha, and Ian Goldberg. Constant-size commitments to polynomials and their applications. In Masayuki Abe, editor, Advances in Cryptology - ASIACRYPT 2010 - 16th International Conference on the Theory and Application of Cryptology and Information Security, Singapore, December 5-9, 2010. Proceedings, volume 6477 of Lecture Notes in Computer Science, pages 177–194. Springer, 2010. doi:10.1007/978-3-642-17373-8_11.
  • [32] A. Kupcu. Efficient cryptography for the next generation secure cloud. PhD thesis, Brown University, 2010.
  • [33] Kamilla Nazirkhanova, Joachim Neu, and David Tse. Information dispersal with provable retrievability for rollups. In Proceedings of the 4th ACM Conference on Advances in Financial Technologies, AFT ’22, pages 180–197, New York, NY, USA, 2023. Association for Computing Machinery.
  • [34] Krzysztof Pietrzak. Proofs of catalytic space. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 59:1–59:25. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ITCS.2019.59.
  • [35] Hovav Shacham and Brent Waters. Compact proofs of retrievability. In Asiacrypt, 2008.
  • [36] Hovav Shacham and Brent Waters. Compact proofs of retrievability. J. Cryptol., 26(3):442–483, July 2013. doi:10.1007/S00145-012-9129-2.
  • [37] Elaine Shi, Rose Silver, and Changrui Mu. Decentralized data archival: New definitions and constructions. Cryptology ePrint Archive, Paper 2025/969, 2025. URL: https://eprint.iacr.org/2025/969.
  • [38] Elaine Shi, Emil Stefanov, and Charalampos Papamanthou. Practical dynamic proofs of retrievability. In ACM Conference on Computer and Communications Security (CCS), 2013.
  • [39] Dominique Unruh. Random oracles and auxiliary input. In Proceedings of the 27th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO’07, pages 205–223, Berlin, Heidelberg, 2007. Springer-Verlag. doi:10.1007/978-3-540-74143-5_12.
  • [40] Riad S. Wahby, Ioanna Tzialla, Abhi Shelat, Justin Thaler, and Michael Walfish. Doubly-efficient zksnarks without trusted setup. In 2018 IEEE Symposium on Security and Privacy (SP), pages 926–943, 2018. doi:10.1109/SP.2018.00060.
  • [41] Lei Yang, Seo Jin Park, Mohammad Alizadeh, Sreeram Kannan, and David Tse. DispersedLedger: High-Throughput byzantine consensus on variable bandwidth networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 493–512, Renton, WA, April 2022. USENIX Association. URL: https://www.usenix.org/conference/nsdi22/presentation/yang.