Abstract 1 Introduction 2 Related Work 3 Technical Preliminaries 4 Stateless Scheduler 5 Stateful Scheduler 6 Conclusion References Appendix A Appendix

On the Efficiency of Dynamic Transaction Scheduling in Blockchain Sharding

Ramesh Adhikari ORCID School of Computer and Cyber Sciences, Augusta University, GA, USA Costas Busch ORCID School of Computer and Cyber Sciences, Augusta University, GA, USA Miroslav Popovic ORCID Faculty of Technical Sciences, University of Novi Sad, Serbia
Abstract

Sharding is a technique to speed up transaction processing in blockchains, where the n processing nodes in the blockchain are divided into s disjoint groups (shards) that can process transactions in parallel. We study dynamic scheduling problems on a shard graph Gs where transactions arrive online over time and are not known in advance. Each transaction may access at most k shards, and we denote by d the worst distance between a transaction and its accessing (destination) shards (the parameter d is unknown to the shards). To handle different values of d, we assume a locality sensitive decomposition of Gs into clusters of shards, where every cluster has a leader shard that schedules transactions for the cluster. We first examine the simpler case of the stateless model, where leaders are not aware of the current state of the transaction accounts, and we prove a O(dlog2smin{k,s}) competitive ratio for latency. We then consider the stateful model, where leader shards gather the current state of accounts, and we prove a O(logsmin{k,s}+log2s) competitive ratio for latency. Each leader calculates the schedule in polynomial time for each transaction that it processes. We show that for any ϵ>0, approximating the optimal schedule within a (min{k,s})1ϵ factor is NP-hard. Hence, our bound for the stateful model is within a poly-log factor from the best possibly achievable. To the best of our knowledge, this is the first work to establish provably efficient dynamic scheduling algorithms for blockchain sharding systems.

Keywords and phrases:
Blockchain, Blockchain Sharding, Dynamic Transaction Scheduling
Copyright and License:
[Uncaptioned image] © Ramesh Adhikari, Costas Busch, and Miroslav Popovic; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Computing methodologies Distributed algorithms
; Theory of computation Scheduling algorithms
Related Version:
Extended Version: https://arxiv.org/abs/2508.07472 [4]
Funding:
This paper is supported by NSF grant CNS-2131538.
Editor:
Dariusz R. Kowalski

1 Introduction

Blockchains are known for their special features, such as fault tolerance, transparency, non-repudiation, immutability, and security, and have been used in various applications and domains [16]. However, a drawback of blockchains is that the size of the blockchain network may impact the latency and throughput of transaction processing. To append a new block in a blockchain network, the participating nodes reach consensus, which is a time and energy-consuming process [2]. Moreover, each node is required to process and store all transactions, which leads to scalability issues in the blockchain system. Sharding protocols have been proposed to address the scalability and performance issues of blockchains [15, 21, 11, 1], which divide the overall blockchain network into smaller groups of nodes called shards that allow for processing transactions in parallel. In the sharded blockchain, independent transactions are processed and committed in multiple shards concurrently, which improves the blockchain system’s throughput. However, most of the existing sharding protocols [15, 13, 21, 1] do not provide formal analysis for the scheduling time complexity (i.e. how fast the transactions can be processed).

We consider a blockchain system consisting of n nodes, which are further divided into s shards, where each shard consists of n/s nodes. Shards are connected in a graph network Gs with a diameter D, and each shard holds a subset of the objects (transaction accounts). We assume that transactions are distributed across the shards, and each transaction accesses at most k accounts. A transaction Ti initially is in one of the shards, which is called the home shard for Ti. For simplicity, we consider each shard has one transaction at a time, and when that transaction gets processed (either commit or abort), a new transaction will be generated at the home shard. Similar to other sharding systems [11, 1, 2], each transaction Ti is split into subtransactions, where each subtransaction accesses an account. A subtransaction of Ti is sent to the destination shard that holds the respective account. We assume that the maximum distance between the home shard of a transaction and the respective destination shards in Gs is at most dD. (The parameter d is not known to the system.)

All home shards process transactions concurrently. A problem occurs when conflicting transactions try to access the same account simultaneously. In such a case, the conflict prohibits the transactions from being committed concurrently and forces them to serialize [2]. Our proposed scheduling algorithms coordinate the home shards and destination shards to process the transactions (and respective subtransactions) in a conflict-free manner in polynomial time. Each destination shard maintains a local blockchain of the subtransactions that are sent to it. The global blockchain can be constructed (if needed) by combining the local blockchains at the shards [1].

We consider online dynamic transaction scheduling problem instances where transactions are not known a priori. Moreover, transactions may arrive online and continuously over time, which are generated by electronic devices or some crypto app that resides on shards. Our proposed schedulers determine the time step for each transaction Ti𝒯 to process and commit. The execution of our scheduling algorithm is partially synchronous, where communication delay is upper bounded by a system parameter. The goal of a scheduling algorithm is to efficiently process all transactions while minimizing the total execution time (makespan). Unlike previous sharding approaches [11, 13, 21], our scheduling algorithms are lock-free, namely, they do not require locking mechanisms for concurrency control.

We use the notion of competitive ratio [8] to determine the performance of our scheduling algorithms. The competitive ratio typically measures how well a given online algorithm performs compared to the best possible offline algorithm for a specific sequence of operations. However, in our model, the transactions generated in the future depend on the execution history. Hence, we define the competitive ratio to capture the volatile transaction history.

Contributions.

To our knowledge, this is the first work to present provably efficient online transaction scheduling algorithms for blockchain sharding systems. We summarize our contributions as follows (also see Table 1):

  • Stateless Scheduling Model: We first provide transaction scheduling algorithms for the stateless model, where a leader shard that is responsible for coordinating transaction execution, does not require knowledge of the current state of the accessed accounts. In this model, we provide two scheduling algorithms:

    • Single-Leader Scheduler: In this scheduling algorithm, one of the shards acts as the leader and all other shards send their transaction information to this leader, which determines the global transaction schedule. Our algorithm works in a partially synchronous communication model, but for the sake of performance analysis purposes, we assume a synchronous model. Let the shard network be represented as a general graph Gs, where each transaction accesses at most k objects (shards). The maximum distance between home shards, accessed shards, and leader is denoted with d. Then, the single-leader scheduler achieves an O(dmin{k,s}) competitive ratio with respect to the optimal scheduler. In the special case where Gs is a clique with unit distances (i.e., d=1), the competitive ratio becomes O(min{k,s}).

    • Multi-Leader Scheduler: A drawback of the single-leader case is that the distance d involves also the position of the leader. On the other hand, in the multi-leader case, d only involves distances between home and respective destination shards. In this scheduler, multiple leaders process the transactions, which distribute the scheduling load among multiple shards. The multi-leader approach allows for a better adaptation to the value d without requiring knowledge of d and without involving distances to the leaders in the definition of d. This approach uses a hierarchical clustering technique [10] to cluster the shard network, which enables the independent scheduling and commitment of transactions within different clusters. This scheduler achieves a competitive ratio of O(dlog2smin{k,s}).

  • Stateful Scheduling Model: We next consider a stateful model where the leader shard requires knowledge of the account states. Namely, a leader shard receives the transactions from the home shards (where transactions are initially generated), and then the leader shard first gathers the current state of the accounts from their corresponding account shards before scheduling and pre-committing the transactions. After receiving the state, the leader pre-commits the transactions locally and forwards the pre-committed batch to the destination shards. In this model, the single-leader scheduler achieves a competitive ratio of O(min{k,s}) and the multi-leader scheduler achieves a competitive ratio of O(logsmin{k,s}+log2s). Note that these competitive ratios do not depend on d (in contrast to the stateless model), which is the benefit of the stateful approach.

  • Approximation Hardness: We also show that for any ϵ>0, obtaining competitive ratio (min{k,s})1ϵ is NP-hard. Hence, our bound for the stateful single-leader scheduler is asymptotically the best we can achieve in polynomial time, and the bound for the stateful multi-leader scheduler is within a poly-log factor of the best achievable.

  • Safety and Liveness Analysis: We formally analyze the correctness of our proposed schedulers by proving both safety and liveness for the single-leader and multi-leader algorithms.

Paper Organization.

The rest of this paper is structured as follows. Section 2 provides related works. Section 3 describes the preliminaries for this study and the sharding model. Section 4 presents a stateless scheduling model with single-leader and multi-leader scheduling algorithms. In Section 5, we provide the stateful single-leader and multi-leader scheduling algorithms. Finally, we give our conclusions in Section 6. Due to space limitations, some proofs are deferred to the appendix, while additional correctness analyses (safety and liveness) of our algorithms are provided in the extended version available on arxiv [4].

2 Related Work

To solve the scalability issue of blockchain, various sharding protocols [15, 21, 11, 17, 20, 6] have been proposed. These protocols have shown promising enhancements in the transaction throughput of blockchain by processing transactions in parallel in multiple shards. However, none of these protocols have specifically explored the theoretical analysis of online transaction scheduling problems in a sharding environment. To process transactions in parallel in the sharding model, some research work [13, 11] has used two-phase locking for concurrency control. However, locks are expensive because when one process locks shared data for reading/writing, all other processes attempting to access the same data set are blocked until the lock is released, which lowers system throughput. Moreover, locks, if not handled and released properly, may cause deadlocks. Our scheduling algorithms do not use locks, as concurrency control is managed by scheduling non-conflicting transactions in parallel. In [1] the authors propose lockless blockchain sharding using multi-version concurrency control. However, they lack a performance analysis, and they do not explore the benefits of locality and optimization techniques for transaction scheduling.

Table 1: Comparison of our proposed online transaction scheduling algorithm’s competitive ratio with related works [2, 3, 14]. The used notations are as follows: s represents a total number of shards, k denotes the maximum number of shards (objects) accessed by each transaction, d denotes the worst distance between any transaction (home shard) and its accessed objects (destination shard), b denotes the burstiness and c1 represents some positive constant. (Note that the bounds in [2] are the actual transaction latencies).
      Proposed Results       Related works
Stateless Model Stateful Model In [2] In [3, 14]
Problem Dynamic Transaction Dynamic Transaction Dynamic Transaction Batch Transaction
Focus Performance Perofrmance Stability Performance
Single Leader O(dmin{k,s}) O(min{k,s}) 36bdmin{k,s} O(kd)
Multi-Leader O(dlog2smin{k,s}) O(logsmin{k,s}+log2s) 2c1bdlog2smin{k,s} O(kdlogdlogs)
Com. Model Partial-synchronous Partial-synchronous Synchronous Synchronous

In a recent work [2] (see Table 1), the authors provide a stability analysis of blockchain sharding considering adversarial transaction generation. Their focus is on stability, not on performance, where they want to maintain a bounded pending transaction queue size and latency. They consider adversarial transaction generation, where at any time interval of duration t, the number of generated transactions using any object is bounded by ρt+b, where ρ1 is the transaction injection rate per unit time and b>0 is a burstiness injection parameter. They consider stateless scheduling model, and for the single leader scheduler where the shards are connected in the clique graph with unit distance they provide the stable transaction rate ρmax{118k,118s}, for which they show the number of pending transactions at any round is at most 4bs (which is the upper bound on queue size in each shard), and the latency of transactions is bounded by 36bmin{k,s}, If this single leader scheduler is considered in the general graph where the transaction and its accessing object are d far away, then their latency becomes 36bdmin{k,s}. Similarly, for a multi-leader scheduler, they provide a stable transaction injection rate ρ1c1dlog2smax{1k,1s}, where c1 is some positive constant. For this scheduling algorithm, they show the upper bound on queue size as 4bs, and transaction latency as 2c1bdlog2smin{k,s}. However, they consider a synchronous communication model, which is not practical in blockchain, and they also do not provide a theoretical analysis of the optimal approximation for the scheduling algorithm, and they only consider a stateless scheduling model. All their latency bounds depend on the burstiness parameter b, which can be arbitrarily large, as it expresses a transaction injection burst of arbitrary size in any given time interval. On the other hand, our system models do not depend on any burstiness parameter, as we adopt a transaction injection model tuned for performance analysis.

In [3, 14] (see Table 1), the authors presented batch scheduling algorithms (for a given set of transactions) while they did not consider dynamic transaction generation. Moreover, their provided bounds are not tight even for batch processing. Furthermore, their algorithms work on a synchronous communication model, which might not be applicable in a real-world distributed blockchain network. The authors in [14] only consider single leader algorithms and have worse performance complexity bounds than [3] by a factor of logD, resulting in a complexity of O(kdlogD) whereas [3] achieves O(kd) approximation for batch transactions. Here, we provide efficient scheduling algorithms with theoretical analysis for dynamic transaction processing in a blockchain sharding system that works in the partially synchronous communication model.

Several works have been conducted on transaction scheduling in shared memory multi-core systems, distributed systems, and transactional memory [7, 8]. In [5, 18, 19], the authors explored transaction scheduling in distributed transactional memory systems aimed to achieve better performance bounds with low communication costs. In [7] they provide offline scheduling for transactional memory, where each transaction attempts to access an object, and once it obtains the object, it executes the transaction. In another work [8], the authors extended their analysis from offline to online scheduling for the transactional memory in a synchronous communication model. However, these works do not address transaction scheduling problems in the context of blockchain sharding. This is because, in the transactional memory model, the considered system models assume that objects are mobile, and once a transaction obtains the object, it immediately executes the transaction. In contrast, in blockchain sharding, an object is static in a shard, and there is a confirmation scheme to confirm and commit each subtransaction consistently in the respective shard.

3 Technical Preliminaries

3.1 Blockchain Sharding Model

We consider a blockchain sharding model similar to [11, 1, 2, 3], consisting of n nodes which are partitioned into s shards S1,S2,,Ss such that Si{1,,n}, for ij, SiSj=, n=i|Si|, and ni=|Si| denotes the number of nodes in shard Si. Let Gs=(V,E,w) denote a weighed graph of shards, where V={S1,S2,,Ss}, the edges E correspond to the connections between the shards, and the weight function w represents the distance between the shards. The graph Gs is complete, since each pair of shards can communicate directly, but the weights of the edges may be non-uniform.

Each shard maintains a local blockchain (which is part of the global blockchain) according to its local accounts and the subtransactions it receives and commits. We use fi to represent the number of Byzantine nodes in shard Si. To guarantee consensus on the current state of the local blockchain, we assume that every shard executes the PBFT [9] or a similar consensus algorithm. In order to achieve Byzantine fault tolerance, we assume each shard Si consists of ni>3fi nodes.

We assume that shards communicate with each other via message passing [11], and here, we are not focusing on optimizing the message size. We adopt the cluster-sending protocol described in [12] and Byshard [11], where shards run consensus (e.g., the PBFT [9] consensus algorithm within the shard) before sending a message. For communication between shards S1 and S2, a set A1S1 of f1+1 nodes in S1 and a set A2S2 of f2+1 nodes in S2 are chosen (where fi is the number of faulty nodes in shard Si). Each node in A1 is instructed to broadcast the message to all nodes in A2. Thus, at least one non-faulty node in S1 will send the correct message value to a non-faulty node in S2. (Actually, A1 needs to have size 2f1+1 to distinguish the correct message.) We consider a partial-synchronous communication model, where sending messages for transactions to their accessing shards has a bounded delay.

Suppose we have a set of shared accounts 𝒪 (which we also call objects). Similar to previous works in [11, 1, 2], we assume that each shard is responsible for a specific subset of the shared objects (accounts). To be more specific, 𝒪 is split into disjoint subsets 𝒪1,,𝒪s, where the set of accounts under the control of shard Si is represented by 𝒪i. Every shard Si keeps track of local subtransactions that use its corresponding objects in 𝒪i.

3.2 Transactions and Subtransactions

Similar to the works in [11, 2, 3], we consider transactions {T1,T2,} that are distributed across different shards. Suppose that transaction Ti is generated in a node vTi within the system, then the home shard of Ti is the shard containing vTi. In this work, we consider transactions that are continuously generated over time. For simplicity and to attain a performance analysis, we assume that each home shard contains at most one transaction at any moment of time, and after the transaction gets processed (either commits or aborts), a new transaction is generated on that home shard.

Similar to work in [11, 1, 2], we define a transaction Ti as a group of subtransactions Ti,a1,,Ti,aj. Each subtransaction Ti,al accesses objects only in 𝒪al and is associated with shard Sal. Therefore, each subtransaction Ti,al has a respective destination shard Sal. The home shard sends the transaction Ti to the leader shard S, which is responsible for processing transaction Ti. Then the leader shard of Ti sends subtransaction Ti,al to shard Sal for processing, where it is appended to the local blockchain of Sal. The subtransactions within a transaction Ti are independent, meaning they do not conflict and can be processed concurrently. An example of transactions and subtransactions is deferred to Appendix A.1.

3.3 Stateless and Stateful Scheduling Models

We define two scheduling models to schedule and process the transactions, the stateless and stateful models, which we describe as follows.

Stateless Scheduling Model.

Let’s suppose there is a designated leader shard S that coordinates the scheduling and processing of transactions. In this model, the leader shard S does not maintain the current state of accounts accessed by the transactions [2, 3, 11]. Upon receiving transactions, S constructs (or extends) a transaction conflict graph and colors the graph using an incremental greedy vertex coloring algorithm to determine the commit order for each transaction. Then the leader S splits each transaction into subtransactions based on accessed accounts and sends them to the corresponding destination shards that hold the relevant account states. Each destination shard maintains the scheduled subtransactions queue schdq and it picks one color subtransaction from the head of schdq, validates the sub-transactions (e.g., checking account balances) and sends a commit or abort vote to the leader. After collecting all votes for a transaction, the leader sends a final decision to each destination shard, which either commits or aborts the subtransactions according to the message received from the leader shard.

For example, suppose S receives transactions T1,T2,T3, each accessing accounts a,b,c, located in shards Sa,Sb,Sc respectively (see Figure 1 (a)). The leader constructs a conflict graph G𝒯 and applies a greedy vertex coloring algorithm to define a commit order. It then splits transactions into sub-transactions:

T1{T1,a,T1,b,T1,c},T2{T2,a,T2,b,T2,c},T3{T3,a,T3,b,T3,c}

Each destination shard queues the received sub-transactions in a schedule queue schdq according to the commit order received from S, and it processes one color subtransaction at a time. This means Sa picks T1,a, Sb picks T1,b and Sc picks T1,c from head of their queues, check the validity and condition of the subtransaction (such as account balance) and send either commit or abort votes to the leader shard. Then the transaction Ti and its subtransactions (T1,a,T1,b and T1,c) are committed or aborted based on the final decision received from the leader shard. Next, each destination shard processes the next color subtransactions, for instance T2,a, from Sa, T2,b from Sb, and T2,c from Sc (see Figure 1 (a)), and this process repeats.

Refer to caption
Figure 1: Illustration of Stateless (a) and Stateful (b) Scheduling Models.
Stateful Scheduling Model.

In the stateful model, the home shard where a transaction is initially generated sends its transaction information to the leader shard S. Then the leader shard S stores these transactions (i.e. 𝒯1,T2,T3) in its pending transaction queue PQ. Then, the leader shard identifies accounts accessed by transactions and requests their state from corresponding shards Sa,Sb,Sc. In other words, before processing the transactions, the leader collects the current state of all accessed accounts from the corresponding destination shards. Once the account states are gathered, the leader constructs a conflict graph on which it applies the incremental greedy vertex coloring algorithm. Then the leader shard performs local pre-commit for valid transactions (e.g., T1, T3) and aborts invalid transactions (e.g., T2). After that, S creates the pre-committed sub-transaction batches: Sa:{T1,a,T3,a},Sb:{T1,b,T3,b},Sc:{T1,c,T3,c} for each destination shard Sa, Sb, Sc. Then these pre-committed batches are sent to the respective destination shards. Since the transactions have already been validated, each destination shard can directly commit and append the received pre-committed order to its local blockchain without further interaction with the leader.

The main difference between the stateless and stateful model is that the stateful model requires the leader to be updated about account states which are at remote shards, while the stateless model does not require leader to be informed about remote accounts.

3.4 Conflicts and Competitive Ratio

Two transactions conflict if they attempt to access the same account, and at least one of the two updates the account. The subtransactions are processed sequentially at each destination shard. For this reason, we extend the notion of conflict to all transactions that access account in the same destination shard.

Definition 1 (Conflict).

Transactions Ti and Tj are said to conflict if they access accounts on the same destination shard Sk and at least one of these transactions writes (updates) the account in Sk.

Transactions that conflict should be processed in a sequential manner to guarantee atomic object update. In such a case, their respective subtransactions should be serialized in the exact same order in every involved shards. To resolve the conflict between two transactions Ti and Tj while accessing the same destination shard Sk, a scheduler must schedule them one after another in such a way that Ti commits before Tj or vice versa. To perform the schedule, we use a conflict graph such that the nodes are transactions, and an edge represents a conflict between two transactions.

We continue with the definition of competitive ratio for our scheduling models. The definition below is an adaptation of the competitive ratio used in dynamic execution in software transactional memory [8]. Since the future transactions depend on the past execution, we define the competitive ratio based on any set of transactions that may appear at any moment of time. Consider a transaction schedule S. Let 𝒯t denote the set of all pending transactions (that have not committed or aborted) at time t. Let t>t denote the time such that all transactions in 𝒯t finalize (commit or abort). Let τ denote the optimal time duration to finalize (commit or abort) all the transactions in 𝒯t if they were the only transactions in the system, processed as a batch. The approximation ratio for S at time t is rS(t)=(tt)/τ. The competitive ratio for S is rS=suptrS(t).

Definition 2 (Algorithm Competitive Ratio).

For online scheduling algorithm 𝒜, the competitive ratio r𝒜 is the maximum competitive ratio over all possible schedules 𝒮 that it produces, r𝒜=supS𝒮rS. (We also say that 𝒜 is r𝒜-competitive.)

4 Stateless Scheduler

In this section, we consider the stateless sharding model [1, 2, 11], where the leader shard is responsible for coordinating transaction processing and does not gather the current state of account information (see Section 3.3). We present two transaction scheduling algorithms: the Single-Leader Scheduler and the Multi-Leader Scheduler.

4.1 Stateless Single-Leader Scheduler

In this section, we describe and analyze the Stateless Single-Leader Scheduler, which operates under a partially synchronous communication model. We assume a designated leader shard S responsible for determining the transaction schedule. All shards send their transactions to the leader shard, which builds a transaction conflict graph and applies an incremental greedy vertex coloring algorithm to determine a schedule.

Algorithm 1 Stateless Single Leader Scheduler.

The algorithm follows an event-driven approach to schedule and process the transactions. When a new transaction Ti is generated at its home shard Si, then the home shard tags the current timestamp to the transaction Ti and sends the transaction to the leader shard S. Upon receiving Ti, the leader adds it to the local transaction set 𝒯 and extends the conflict graph G𝒯 with this new transaction (Ti). If Ti is older than any already-colored but uncommitted transactions (say Tx), the leader cancels the color of those newer transactions, notifies the relevant shards, and reprocesses them later. This ensures older transactions are prioritized, avoiding starvation. The leader then runs an incremental greedy vertex coloring algorithm [8] to assign colors to all newly received transactions, without modifying the colors of already scheduled old transactions. This ensures that the processing time of already scheduled transactions is not affected by newly generated transactions. Note that a newer transaction might receive a lower color than an older one because the new one does not conflict with any other transaction (except one old transaction), while the old transaction conflicts with others as well. To prevent this and ensure a fair execution order, we assign each new transaction a color no lower than the smallest color among pending old transactions. This approach guarantees progress because at each time step, the lowest possible color will increase over time. After coloring and determining the schedule, each transaction is then split into subtransactions Ti,j based on the destination shards it accesses, and these subtransactions are sent to the corresponding shards Sj for processing.

Each destination shard Sj maintains a local scheduled queue schdq (consisting of subtransactions that have been scheduled but not yet committed) and appends incoming subtransactions into schdq, which stores subtransactions in the order of their assigned color. To handle partial synchrony, each destination shard Sj uses a busy flag to track whether it is currently processing (in-transit and not committed yet) a subtransaction. If the shard is not busy, it picks one subtransaction from the head of the queue and validates it (e.g., checking conditions like account balance). If the subtransaction is valid, the shard Sj sends a commit vote to the leader S; otherwise, it sends an abort vote. Once the leader shard receives votes from all relevant destination shards for a transaction Ti, it decides whether the transaction should be committed or aborted. If all subtransactions vote to commit, the leader sends a confirmed commit to each destination shard; otherwise, if any one of the shard send an abort vote, it sends a confirmed abort. After the decision, the transaction Ti is removed from the conflict graph G𝒯 and the transaction set 𝒯, and the outcome (committed or aborted) is sent to the home shard of Ti.

Upon receiving the confirmed decision, each destination shard either commits the subtransaction by appending it to the local blockchain or aborts it. If the scheduled queue is not empty, the shard continues processing the next subtransaction. If the queue becomes empty, the shard marks itself as not busy. Finally, upon receiving the outcome from the leader, the home shard generates a new transaction and repeats the process. This single leader scheduling approach ensures conflict-free execution while preserving consistency and fairness in transaction processing across shards.

Performance Analysis of Single-Leader Scheduler (Algorithm 1).

Our proposed scheduling algorithm works on a partial-synchronous communication model; for the sake of performance analysis only, we consider the synchronous communication mode. In the following, we analyze the time units required to process transactions by Algorithm 1. We are focusing on the time period after the leader shard has determined the schedule for the transactions. In the synchronous case, a time unit is the time to send a message along an edge of unit weight. In the single-leader case, d is sensitive to the position of the leader and d denotes the maximum distance between any of the involved shards (home, destination shards, leader shard). In the multi-leader case, the distance to the leaders is not included in the definition of d.

Theorem 3.

[General Graph] In the General graph, where the transactions, their accessing objects, and the leader are at most d distance away from each other, Algorithm 1 has O(dmin{k,s}) competitive ratio.

Proof.

Consider a set of transactions 𝒯 generated at or before time t that are still pending (neither committed nor aborted) at time t. Let G𝒯 denote the conflict graph for 𝒯, where two transactions conflict if they have a common destination shard. Since we use greedy coloring to color G𝒯, the number of distinct colors assigned to the transactions in 𝒯 depends only on the coloring of G𝒯, and not on the colors of the transactions that have been finalized (committed or aborted) before t. (This holds since transactions in 𝒯 may use smaller colors of transactions committed before t.)

Let li denote the number of transactions in 𝒯 that use objects in shard Si. Let l=maxli. We have that l is a lower bound on the time that it takes to finalize (commit or abort) the transactions in 𝒯, since at least l subtransactions need to serialize in a destination shard.

First, consider the case where ks. We have that each transaction conflicts with at most kl other transactions. Hence G𝒯 can be colored with at most kl+1 colors. The distance between a transaction (home shard) and its accessing objects(destination shards) is at most d away, and to commit subtransactions after being scheduled, Algorithm 1 takes 3 steps of interactions (for each color) between the leader shard and the destination shard. This means each color corresponds to the 3d time units. Thus, it takes at most (kl+1)3d=O(kld) time units to confirm and commit the transactions. Hence, for transactions 𝒯, the approximation of their finalization time is O(kld/l)=O(kd).

Next, consider the case k>s. We can write 𝒯=AB, where A are the transactions which access at most s destination shards, while B are the transactions which access more than s destination shards. Each transaction in A conflicts with at most ls other transactions. Hence, the transactions in A need at most ls+1 distinct colors. The transactions in B can be serialized, requiring at most |B| distinct colors. Hence, the conflict graph GT can be colored with at most ls+1+|B| colors, which implies a schedule of length O(d(ls+|B|)) steps to finalize the transactions 𝒯. Since each transaction in B accesses more than s shards, there is a shard accessed by more than (|B|s)/s=|B|/s transactions. Thus, l>|B|/s. Hence, for transactions 𝒯, the approximation of their finalization time is O(d(ls+|B|)/l)=O(ds+d|B|/l)=O(ds+ds)=O(ds).

Therefore, combining the approximations for the cases ks and k>s, we have that the combined approximation for the finalization time for 𝒯 is O(dmin{k,s}). Since t is chosen arbitrarily, we have that the competitive ratio of Algorithm 1 is O(dmin{k,s}).

Suppose that shards are connected in a clique graph with unit distance, where every shard is connected to every other shard with unit distance. So in this case d=1. Then from Theorem 3, Algorithm 1 has an O(min{k,s}) competitive ratio for a clique graph with unit distance. Thus, we have:

Corollary 4 (Unit Distance Clique).

Algorithm 1 has an O(min{k,s}) competitive ratio for a clique graph with unit distance.

We continue to show that it is an NP-hard problem to approximate the optimal transaction schedule. Thus, the provided bound in Corollary 4, is the best we can do with a polynomial time scheduling algorithm. The result below applies to both the stateful and stateless model.

Theorem 5.

For all ϵ>0, it is an NP-hard problem to produce a transaction schedule that achieves a competitive ratio (min{k,s})1ϵ.

The proof of Theorem 5 is deferred to Appendix A.3.

4.2 Multi-Leader Scheduler

This section provides the multi-leader scheduler where multiple leaders schedule and process the transactions, distribute the congestion, and load across different leaders. The multi-leader approach allows adaptation to the value d without requiring knowledge of d. Also, here the value d depends only on the maximum distance between the home and destination shards (without involving distances to the leaders). Therefore, the value of d captures better the locality of the transactions, and the resulting schedule allows for shorter messages between home and destination shards. The concepts that we introduce for this algorithm will play a key role for the development of the stateless multi-leader algorithm.

4.2.1 Shard Clustering

In the multi-leader scheduler, shards are distributed across the network, and the distance between the home shard of the transaction and its accessing objects (destination shards) ranges from 1 to D, where D is the diameter of the shard graph. Let us suppose shards graph Gs constructed with s shards, where the weights of edges between shards denote the distances between them. We consider that Gs is known to all the shards. We define z-neighborhood of shard Si as the set of shards within a distance of at most z from Si. Moreover, the 0-neighborhood of shard Si is the Si itself.

We consider that our multi-leader scheduling algorithm uses a hierarchical decomposition of Gs which is known to all the shards and calculated before the algorithm starts. This shard clustering (graph decomposition) is based on the clustering techniques in [10] and which were later used in [18, 8, 2]. We divide the shard graph Gs into the hierarchy of clusters with H1=logD+1 layers (logarithms are in base 2), and a layer is a set of clusters, and a cluster is a set of shards. Layer q, where 0q<H1, is a sparse cover of Gs such that: (i) Every cluster of layer q has (strong) diameter of at most O(2qlogs). (ii) Every shard participates in no more than O(logs) different clusters at layer q. (iii) For each shard Si there exists a cluster at layer q which contains the (2q1)-neighborhood of Si within that cluster.

For each layer q, the sparse cover construction in [10] is actually obtained as a collection of H2=O(logs) partitions of Gs. These H2 partitions are ordered as sub-layers of layer q labeled from 0 to H21. A shard might participate in all H2 sub-layers but potentially belongs to a different cluster at each sub-layer. At least one of these H2 clusters at layer q contains the whole 2q1 neighborhood of Si.

In each cluster at layer q, a leader shard S is specifically designated such that the leader’s (2q1)-neighborhood is in that cluster. As we give an idea of layers and sub-layers, we define the concept of height as a tuple h=(h1,h2), where h1 denotes the layer and h2 denotes the sub-layer. Similar to [18, 8, 2], heights follow lexicographic order.

The home cluster for each transaction Ti is defined as follows: suppose Si is the home shard of Ti, and z is the maximum distance from Si to the destination shards that will be accessed by Ti; the home cluster of Ti is the lowest-layer (and lowest sub-layer) cluster in the hierarchy that contains z-neighborhood of Si. Each home cluster consists of one dedicated leader shard, which will handle all the transactions that have their home shard in that cluster (i.e., transaction information will be sent from the home shard to the cluster leader shard to determine the schedule). An example of hierarchical clustering is presented in Appendix A.4.

4.2.2 Stateless Multi-Leader Scheduler

We consider a hierarchical clustering of the shard graph Gs, which is assumed to be globally known by all shards. Each cluster C in this hierarchy is characterized by a unique height (q,r) which corresponds to its layer q and sublayer r, and each cluster C has its designated leader shard S. The leader shard is responsible for scheduling and coordinating the processing of all transactions whose home cluster is C. Each home shard Si maintains a local timestamp ts to tag newly generated transactions. Additionally, each destination shard Sj maintains a local scheduling queue schdq and lexicographically orders for the incoming subtransactions using the tuple (ts,q,r,color), where color is an integer assigned to the transaction by the leader shard S through vertex coloring. Algorithm 2 invokes Algorithm 1 in each cluster C to process their transactions.

Algorithm 2 works in a partially synchronous model and follows an event-driven execution by message passing. When a new transaction Ti is generated at its home shard Si, then the home shard Si determines the lowest cluster C at height (q,r) that includes both Si and all of the destination shards accessed by Ti. Moreover, the transaction is tagged with its local timestamp ts, along with the cluster identifiers q and r, and is then sent to the cluster’s leader shard S.

Upon receiving new transaction(s) Ti, the leader shard S of cluster C invokes Algorithm 1 to process their transactions, where leader shard S adds Ti to the transaction set 𝒯C of cluster C and updates the corresponding transaction conflict graph G𝒯C to incorporate the new transaction Ti. Then the leader shard uses an incremental greedy vertex coloring algorithm [8] to assign a color only to the newly received transaction without affecting already colored (scheduled) transactions. Once colored, the transaction is split into subtransactions Ti,j, and sent to the respective destination shard Sj.

Since multiple leader shards process their transactions concurrently by invoking the Algorithm 1, destination shards may receive the subtransactions from different clusters simultaneously. To handle this, we modify the parameters and processing technique of Algorithm 1 as follows: each destination shard Sj maintains a scheduled subtransactions queue schqd, which is ordered lexicographically by the tuple (ts,q,r,color). The additional parameters (ts,q,r) denote the timestamp ts, and hierarchical cluster heights (layers q and sublayers r) in the shard graph Gs. Moreover, each destination shard Sj processes its subtransactions from the head of schdq following the steps in Algorithm 1 with the modified ordering criteria.

Algorithm 2 Stateless Multi-Leader Scheduler.

Additionally, if the destination shard is busy and receives a new subtransaction Ti,j such that ts(Ti,j)<ts(Ti,j) in lexicographic order, this means Ti,j has a higher priority where Ti,j is the currently processed (but not committed) subtransaction, then the shard give priority to Ti,j by sending an ignore Ti,j message to its leader, indicating that a higher-priority transaction (subtransaction Ti,j) should proceed first. Then, when the leader receives an ignore Ti,j message for a subtransaction Ti,j and the decision for Ti has not yet been made (i.e., not all votes have been received), the leader discards the vote from Sj and replies with an ignored Ti,j message to the destination shard Sj. If the decision has already been made (i.e, confirm commit or confirm abort) by the leader shard, then no further action is taken for particular subtransaction Ti,j at the leader shard S. After the destination shard Sj receives an ignored message for Ti,j, then it reinserts Ti,j into the scheduled queue, reorders the queue lexicographically, and resumes processing from the head.

Finally, when the home shard Si receives the final outcome of its transaction Ti, it generates a new transaction and sends it to the corresponding cluster leader shard, and the process repeats. This multi-leader scheduling framework ensures conflict-free and consistent execution by leveraging lexicographic ordering over the tuple (ts,q,r,color), and maintains the fairness and parallelism across shards in the presence of partial synchrony.

Performance Analysis of Stateless Multi-Leader Scheduler.

The multi-leader scheduler is the extended version of the single-leader scheduler (Algorithm 3) while introducing an additional overhead cost due to its shard (hierarchical) clustering structure and comes from the layers and sublayers of the clusters.

Theorem 6.

In Multi-leader scheduler (Algorithm 2), where the transactions and their accessing objects are at most d distance away from each other, Algorithm 2 has O(dlog2smin{k,s}) competitive ratio.

The proof of Theorem 6 is deferred to Appendix A.6.

5 Stateful Scheduler

In this scheduler model, the leader shard gathers all of the transactions and the current states of the accessing accounts and pre-commits the transactions at the leader. After that, the leader creates the pre-committed subtransactions batch and sends that batch to the respective destination shard, where each destination shard reaches a consensus on the received subtransaction order and adds it to their local blockchain. We provide two stateful scheduling algorithms, one with a single leader and the other with multiple leaders.

5.1 Stateful Single-Leader Scheduler

We present and analyze the stateful single-leader scheduler, where one of the shards is considered as the leader S, which is responsible for scheduling and processing all the transactions.

When a new transaction Ti is generated at its home shard Si, Si sends Ti to the leader shard S. Upon receipt, S appends Ti to its local pending queue PQ. Scheduling event is triggered periodically, either every 4λ time units or upon processing transactions associated with λ distinct colors. Here, λ denotes the worst-case communication delay between any two shards, which is at most the diameter of the shard communication graph Gs. The 4λ bound accounts for the communication delays involved in acquiring state information from remote shards and completing the pre-commitment phase and sending the pre-committing batch to the destination shard.

Algorithm 3 Steteful Single Leader Scheduler.

When the scheduling event is triggered, the leader shard moves its pending transactions from PQ into the scheduling transaction set 𝒯 and identifies the set of accounts 𝒪v accessed by transactions which are in T. If any account state Oj𝒪v is not locally available at S, it determines the responsible destination shard Sj for each such account, and sends batched account state requests to the corresponding shards. If all required states are already available in S, an internal State-Ready (i.e. already available locally) event is triggered immediately.

Upon receiving a state request, each destination shard Sj responds with the current state of the requested accounts (e.g., balances). Then, once all necessary account states are collected at S, it extends the conflict graph G𝒯 by incorporating the new transactions in 𝒯 Then the leader shard S runs the incremental greedy vertex coloring algorithm [8] on G𝒯 and assigns at most ζ colors without altering the coloring of previously scheduled old transactions.

The leader then iteratively processes transactions color by color. For each color group ζc, S verifies transaction conditions (e.g., sufficient balance) using the up-to-date account state it gathers. Transactions that are valid and conditions are satisfied are pre-committed, while invalid ones are aborted. Then S splits each pre-committed transaction Ti into subtransactions Ti,j based on its accessed shards. These subtransactions are appended to a corresponding pre-commit batch PrecommitSubTxnBatch(Sj) for each destination shard Sj. After processing a transaction, it is removed from 𝒯 and G𝒯, and the outcome (committed or aborted) is reported back to the transaction’s home shard Si to initiate the next transaction.

The pre-commitment phase terminates once λ colors are processed, after which S dispatches all PrecommitSubTxnBatch(Sj) batches to their respective destination shards in parallel. Each destination shard Sj then reaches consensus on the order of subtransactions in the received batch and appends them to its local blockchain. The leader shard S, then waits and proceeds to the next scheduling batch.

Performance Analysis of Stateful Single-Leader Scheduler.

In the following, we analyze the time unit required to process transactions by Algorithm 3. We focus on the special case where the maximum distance between the transactions, their accessing objects, and the leader is at most d, and at least one transaction accesses objects at a distance Ω(d). This special case is useful for the analysis of the multi-leader case. We are focusing on the time period after the leader shard has determined the schedule for the transactions. This is because the scheduling and committing steps are executed in parallel.

Theorem 7.

[General Graph] In the General graph, where the transactions, their accessing objects, and the leader are at most d distance away from each other, and at least one transaction is Ω(d) distance from the accessing shards, Algorithm 3 has O(min{k,s}) competitive ratio.

The proof of Theorem 7 is deferred to Appendix A.7.

5.2 Stateful Multi-Leader Scheduler

We present a stateful multi-leader scheduler in which multiple leader shards are responsible for scheduling and processing transactions. In the single-leader algorithm, the value d includes the distance to the leader, but in the multi-leader, d does not include the relative distance to the leader. This allows the multi-leader algorithm to capture better the locality of transactions, allowing for shorter distance messages between the involved home and destination shards.

The system assumes a hierarchical cluster decomposition [10] of the shard graph Gs, which is globally known to all shards. Each cluster C(q,r) in the hierarchy is associated with a leader shard S, a pending transaction queue PQ, a scheduled transaction set 𝒯, and a transaction conflict graph G𝒯. The parameter λC denotes the worst-case communication delay between any two shards within the cluster C, which arises from the assumption of a partially synchronous communication model.

In multi-leader scheduling Algorithm 4 (see in Appendix A.9), when a transaction Ti is generated at its home shard Si, the shard identifies the lowest cluster C(q,r) that contains all the shards accessed by Ti, and then forwards Ti to the leader shard S of of cluster C. The leader shard S appends the received transaction to its pending queue PQ. Periodically, the leader checks if either 4λC time units have elapsed since the last scheduling event or if λC colors of scheduled transactions have been processed by S. If either condition is met and the leader holds the scheduleControl, it invokes the single-leader scheduler (Algorithm 3) on its local structures (PQ,𝒯,G𝒯,λC) to process transactions.

The scheduling control, denoted by the boolean flag scheduleControl, determines which cluster can perform scheduling operations at a given time unit. The control flows hierarchically between parent and child clusters. A parent cluster of C is any cluster at a higher level in the hierarchy (with height (q,r)>(q,r)) that shares at least one shard with C. Similarly, a child cluster of C is a lower-level cluster (with height (q′′,r′′)<(q,r)) that overlaps with C. Clusters may have multiple parents and children. If C is at the bottom-most level (height (0,0)), initially it has scheduleControl. Otherwise, it must request control from all its children. Once all children respond the scheduleControl, the leader S sets scheduleControl to true and proceeds with the scheduling.

After executing the single-leader scheduler, if the parent cluster C requests control, the leader transfers scheduleControl to the parent and sets it to false locally. If instead a child cluster C′′ has made a request, the control is passed down to the child. If there are no remaining transactions to process, the control is also passed downward to allow lower-level clusters to schedule pending transactions. If the leader does not have scheduleControl when scheduling should occur, it sends a control request to the current holder (parent or child). Additionally, if C receives a control request from a parent C while not holding control, it forwards the request to its children. Once all children respond positively, it passes control up to C. This hierarchical and event-driven mechanism ensures coordinated and conflict-free scheduling across multiple levels of the cluster hierarchy.

Theorem 8.

In Multi-leader scheduler (Algorithm 4), where the transactions and their accessing objects are at most d distance away from each other, Algorithm 4 has O(logsmin{k,s}+log2s) competitive ratio.

The proof of Theorem 8 is deferred to Appendix A.8.

6 Conclusion

We presented efficient scheduling algorithms for processing dynamic transactions in blockchain sharding systems. Our proposed framework operates under a partially synchronous communication model, which realistically captures the behavior of many real-world blockchain environments. We introduced both stateless and stateful scheduling models, each of which includes single-leader and multi-leader algorithms for transaction scheduling and processing. For these algorithms, we provided competitive ratios relative to an optimal scheduler and established both upper and lower bounds on the scheduling delay.

For future work, we plan to explore efficient inter-shard communication mechanisms, particularly under conditions of network congestion where communication links have bounded capacity. We also aim to conduct extensive simulations and real-world experiments to evaluate the practical performance of our proposed protocols.

References

  • [1] Ramesh Adhikari and Costas Busch. Lockless blockchain sharding with multiversion control. In International Colloquium on Structural Information and Communication Complexity, pages 112–131. Springer, 2023. doi:10.1007/978-3-031-32733-9_6.
  • [2] Ramesh Adhikari, Costas Busch, and Dariusz Kowalski. Stable blockchain sharding under adversarial transaction generation. In Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures, 2024.
  • [3] Ramesh Adhikari, Costas Busch, and Miroslav Popovic. Fast transaction scheduling in blockchain sharding. arXiv preprint arXiv:2405.15015, 2024. doi:10.48550/arXiv.2405.15015.
  • [4] Ramesh Adhikari, Costas Busch, and Miroslav Popovic. On the efficiency of dynamic transaction scheduling in blockchain sharding. arXiv preprint arXiv:2508.07472, 2025.
  • [5] Hagit Attiya, Vincent Gramoli, and Alessia Milani. Directory protocols for distributed transactional memory. Transactional Memory. Foundations, Algorithms, Tools, and Applications: COST Action Euro-TM IC1001, pages 367–391, 2015. doi:10.1007/978-3-319-14720-8_17.
  • [6] Zeta Avarikioti and Dimitris Karakostas. Harmony technical report, 2022.
  • [7] Costas Busch, Maurice Herlihy, Miroslav Popovic, and Gokarna Sharma. Fast scheduling in distributed transactional memory. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 173–182, 2017. doi:10.1145/3087556.3087565.
  • [8] Costas Busch, Maurice Herlihy, Miroslav Popovic, and Gokarna Sharma. Dynamic scheduling in distributed transactional memory. Distributed Computing, 35(1):19–36, 2022. doi:10.1007/S00446-021-00410-W.
  • [9] Miguel Castro, Barbara Liskov, et al. Practical byzantine fault tolerance. In OsDI, volume 99, pages 173–186, 1999.
  • [10] Anupam Gupta, Mohammad T Hajiaghayi, and Harald Räcke. Oblivious network design. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 970–979, 2006.
  • [11] Jelle Hellings and Mohammad Sadoghi. Byshard: Sharding in a byzantine environment. Proceedings of the VLDB Endowment, 14(11):2230–2243, 2021. doi:10.14778/3476249.3476275.
  • [12] Jelle Hellings and Mohammad Sadoghi. The fault-tolerant cluster-sending problem. In Foundations of Information and Knowledge Systems: 12th International Symposium, FoIKS 2022, Helsinki, Finland, June 20–23, 2022, Proceedings, pages 168–186. Springer, 2022. doi:10.1007/978-3-031-11321-5_10.
  • [13] Eleftherios Kokoris-Kogias, Philipp Jovanovic, Linus Gasser, Nicolas Gailly, Ewa Syta, and Bryan Ford. Omniledger: A secure, scale-out, decentralized ledger via sharding. In 2018 IEEE Symposium on Security and Privacy (SP), pages 583–598. IEEE, 2018. doi:10.1109/SP.2018.000-5.
  • [14] Ao Liu, Jing Chen, Kun He, Ruiying Du, Jiahua Xu, Cong Wu, Yebo Feng, Teng Li, and Jianfeng Ma. Dynashard: Secure and adaptive blockchain sharding protocol with hybrid consensus and dynamic shard management. IEEE Internet of Things Journal, 2024.
  • [15] Loi Luu, Viswesh Narayanan, Chaodong Zheng, Kunal Baweja, Seth Gilbert, and Prateek Saxena. A secure sharding protocol for open blockchains. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 17–30, 2016. doi:10.1145/2976749.2978389.
  • [16] Ahmed Afif Monrat, Olov Schelén, and Karl Andersson. A survey of blockchain from the perspectives of applications, challenges, and opportunities. Ieee Access, 7:117134–117151, 2019. doi:10.1109/ACCESS.2019.2936094.
  • [17] A Secure. The zilliqa project: A secure, scalable blockchain platform, 2018.
  • [18] Gokarna Sharma and Costas Busch. Distributed transactional memory for general networks. Distributed computing, 27(5):329–362, 2014. doi:10.1007/S00446-014-0214-7.
  • [19] Gokarna Sharma and Costas Busch. A load balanced directory for distributed shared memory objects. Journal of Parallel and Distributed Computing, 78:6–24, 2015. doi:10.1016/J.JPDC.2015.02.002.
  • [20] Alex Skidanov and Illia Polosukhin. Nightshade: Near protocol sharding design. URL: https://nearprotocol. com/downloads/Nightshade. pdf, 39, 2019.
  • [21] Mahdi Zamani, Mahnush Movahedi, and Mariana Raykova. Rapidchain: Scaling blockchain via full sharding. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pages 931–948, New York, NY, USA, 2018. Association for Computing Machinery. doi:10.1145/3243734.3243853.
  • [22] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’06, pages 681–690, New York, NY, USA, 2006. Association for Computing Machinery. doi:10.1145/1132516.1132612.

Appendix A Appendix

A.1 Example of Transaction and Subtransactions

Suppose transaction T1 is: “Transfer 100 coins from account A to account B”. Let us assume that the accounts of A and B reside on different shards Sa and Sb, respectively. T1 splits into the following subtransactions:

  • T1,a in Sa: Condition: Check if account A has at least 100 coins.
          Action: Deduct 100 coins from account A.

  • T1,b in Sb: Action: Add 100 coins to account B.

A.2 Correctness Analysis of Stateless Single-Leader Scheduler

Our proposed scheduling algorithm works on a partial-synchronous communication model; for the sake of analysis only, we consider the synchronous communication mode.

Lemma 9 (Safety).

If two transactions conflict with each other in Algorithm 1, then they will commit in different time slots, and the local chain produced by Algorithm 1 ensures blockchain serialization.

Proof.

We prove this by induction (analyzing) the execution of Algorithm 1, where each home shard sends its transaction to the leader shard (Line 4), and the leader shard constructs the transaction conflict graph G𝒯 (Line 6). Then the leader used the incremental greedy vertex coloring algorithm [8] on the conflict graph G𝒯 (Line 8). As conflicting transactions share an edge in G𝒯, they are assigned different colors and are processed in different time slots, which provides the valid commit order. Moreover, each color corresponds to a unique serialization time slot. The leader shard splits the transaction into subtransactions and sends them to the destination shard after coloring (see Line 9), then each destination shard keeps that ordering in the schedule queue (schdq) and process subtransactions one by one according to the color they get (see Line 11-14), which guarantees the consistent schedule order in each shard. Moreover, the leader shard coordinates to commit the subtransactions in each destination shard, which ensures the consistent commitment (see Line 16-17). As the subtransactions are committed according to the color they receive, and each color corresponds to a globally consistent time slot, this provides global serialization.

Lemma 10 (Liveness).

Algorithm 1 guarantees that every generated transaction will eventually be either committed or aborted.

Proof.

We prove liveness by induction, showing that every transaction Ti is either committed or aborted in finite time. Each new transaction Ti is sent to a leader shard S (Line 4), which adds it to the set 𝒯 and the conflict graph G𝒯. If Ti is older than any already colored but not committed transaction Tx, the algorithm cancels the color of Tx and re-colors the graph (Line 7). Coloring is performed incrementally (Line 8) and preserves the colors of previously scheduled transactions. Thus, older transactions are always prioritized, and no transaction is indefinitely prevented from being scheduled due to newer ones. Note that a newer transaction might receive a lower color than an older one because the new one does not conflict with any other transaction (except one old transaction), while the old transaction conflicts with others as well. To prevent this and ensure a fair execution order, we assign each new transaction a color no lower than the smallest color among pending old transactions. This approach guarantees progress because at each time step, the lowest possible color will increase over time.

Moreover, once Ti is colored, its subtransactions are sent to the respective destination shards (Line 9), where they are placed into a queue schdq sorted by color (Line 11). Each shard processes one color group at a time, controlled by a busy flag. After finishing one subtransaction (commit or abort), the shard proceeds to the next one in the queue. Since every color is eventually dequeued, and subtransactions are processed in order, every scheduled subtransaction is eventually processed. Thus, every transaction is either committed or aborted in a finite time, and this proves the liveness.

Corollary 11.

From Lemma 9 and Lemma 10, Algorithm 1 ensures the safety and liveness of the transactions.

A.3 Proof of Theorem 5

Proof.

We will use a reduction from vertex coloring. For all ϵ>0, the problem of approximating the chromatic number of a graph with n nodes within a factor n1ϵ is NP-hard [22].

Consider an instance of vertex coloring on a graph H=(VH,EH) with n nodes. We can transform the vertex coloring instance H to a scheduling problem instance on a graph shard Gs with s=|EH| shards, such that Gs is a synchronous clique with unit distances between the shards. Furthermore, each edge of EH corresponds to a unique node of Gs.

Let 𝒯 be a set of n transactions, all generated concurrently at time t=0, such that each node viVH is mapped to transaction Ti𝒯. For each edge (vi,vj)EH we create a conflict between respective transactions Ti and Tj by making the transactions access a common object in the unique shard of Gs that corresponds to edge (vi,vj). Let G𝒯 be the respective conflict graph for the transactions 𝒯. The conflict graph G𝒯 is isomorphic to H.

A correct execution schedule for 𝒯 (which gives a valid serialization of the transactions in 𝒯) can be represented as a DAG where nodes are transactions and transaction Ti points to Tj if they conflict and Ti executes first in the respective common destination shard with Tj. Then, a layering of the DAG nodes starting from source nodes provides a unique time step for each transaction, so that conflicting transactions receive different time steps. Thus, an execution schedule of the transactions in 𝒯 gives a valid vertex coloring of the nodes in G𝒯 which provides a valid coloring for H. The best length of the transaction schedule given from the DAG, is equal to the number of colors that can be assigned to H.

Since |EH|n(n1)/2, we have that s=O(n2). Each transaction conflicts with at most kn1 other transactions. Therefore, given k and s, we can create the reduction from graph coloring for n=min(k,s). Consequently, the NP-hardness of the scheduling problem in Gs follows from the NP-hardness of the reduced graph coloring problem with n=min(k,s).

A.4 Example of Hierarchical Clustering

Figure 2 shows an example of hierarchical clustering, assuming shards are connected as if they are in a line, where edges in the line have low weights and edges not in the line have large weights. (We omit the sublayers to simplify the example.) Transaction T1 resides in shard S3 and has home cluster x at layer 1. The reason for the home cluster x selection is that T1 accesses an object in S3 and S4, and both of them are in cluster x, and x is the lowest layer cluster including S3 and S4. Similarly, suppose transaction T2, which resides in S5, has home cluster y at layer 2, because T2 accesses an object in S5 and S8, and y is the lowest layer cluster that includes both S5 and S8. Similarly, T3 has home cluster z at layer 3.

Refer to caption
Figure 2: Simple example of cluster decomposition of shard graph Gs.

A.5 Correctness Analysis of Stateless Multi-Leader Scheduler

Lemma 12 (Safety).

If two transactions conflict with each other in Algorithm 2, then they will commit in different time slots, and the local chain produced by Algorithm 2 ensures blockchain serialization.

The proof of Lemma 12 is available in arxiv [4]

Lemma 13 (Liveness).

Algorithm 2 guarantees that every generated transaction will eventually be committed or aborted.

The proof of Lemma 13 is available in arxiv [4].

Corollary 14.

From Lemma 12 and Lemma 13, Algorithm 2 ensures the safety and liveness of the transactions.

A.6 Proof of Theorem 6

Proof.

In the multi-layer scheduler, we need to consider the transactions from all layers and sublayers of the clusters. Suppose q is the topmost layer accessed by any transaction where the diameter of the cluster on that layer is at most dq.

Consider the destination shard Sj, and we have only subtransactions from one leader shard of cluster layer q where the distance between the transaction and its accessing shard is at most dq, and it has maximum competitive ratio denoted by τq=O(dqmin{k,s}) (from Theorem 3) than any other cluster. Therefore the destination shard Sj needs to process subtransactions from all layers 0,,q and from sublayers 0,,H21, and those transactions are processed according to their assigned order.

As discussed in Section 4.2.1, a cluster at layer q has a diameter at most O(2qlogs). Thus dq=O(2qlogs)=c2qlogs, for some positive constant c. This implies q=0qdq2dq. Thus, the competitive ratio of Algorithm 2 considering transactions from all layers and sublayers at destination shard Sj is at most:

τtotalq=0qr=0H21τqq=0qr=0H21O(dqmin{k,s})O(dqH2min{k,s}). (1)

We can replace H2=O(logs) and dq=O(dlogs) (see Section 4.2.1), then Equation 1 becomes:

O(dlog2smin{k,s}).

The correctness analysis of stateful single leader and multi-leader scheduler is available in arxiv [4].

A.7 Proof of Theorem 7

Proof.

This proof follows the same arguments discussed in the proof of Theorem 3. Consider a set of transactions 𝒯 generated at or before time t that are still pending (neither committed nor aborted) at time t. Let G𝒯 denote the conflict graph for 𝒯, where two transactions conflict if they have a common destination shard. Let li denote the number of transactions in 𝒯 that use objects in shard Si. Let l=maxli. Moreover, from the definition of d, at least one transaction is d distance away from the destination shard or leader. So we have that Ω(l+d) is a lower bound on the time that it takes to finalize (commit or abort) the transactions in 𝒯, since at least l subtransactions need to serialize in a destination shard, and at least one transaction is d distance away.

First, consider the case where ks. We have that each transaction conflicts with at most kl other transactions. Hence G𝒯 can be colored with at most kl+1 colors.

Algorithm 3 schedules and commits transactions in batches. For each batch, the leader shard performs the following steps: first, it gathers the state of accessed accounts, takes at most 2d time units (request and receive each takes at most d time units). After pre-committing, the leader sends the pre-commit batch to destination shards, which takes d time units. Additionally, destination shards reach consensus on the received batch within 1 time unit. Hence, the total delay per batch is at most 3d+1.

Since the algorithm uses at most kl+1 colors (batches), the total finalization time is at most: kl+1+3d+2=O(kl+d).

Next, consider the case k>s. Following the same reasoning above and from Theorem 3, we get O(ls+d) time to finalize the transactions 𝒯.

Overall, Algorithm 3 requires O(lmin{k,s}+d) time units to finalize the transactions. Since Ω(l+d) is a lower bound, we have that the approximation factor of the schedule for 𝒯 is O(min{k,s}).

Since t is chosen arbitrarily, we have that the competitive ratio of Algorithm 3 is O(min{k,s}).

A.8 Proof of Theorem 8

Proof.

Similar to Theorem 6, consider the destination shard Sj, as discussed in the proof of Theorem 7, if we have only subtransactions from one leader shard of cluster layer q where the distance between the transaction and its accessing shard is at most dq, then the time to process transactions is O(lmin{k,s}+dq) or equivalently at most c1(lmin{k,s}+dq) time for some positive constant c1. Suppose q is the maximum layer accessed by any transaction where the diameter of the cluster on that layer is at most dq. Then the destination shard Sj needs to process subtransactions from all layers 0,,q and from sublayers 0,,H21, and those transactions are processed according to their assigned order.

As discussed in Section 4.2.1, a cluster at layer q has a diameter at most O(2qlogs). Thus dq=O(2qlogs)=c2qlogs, for some positive constant c. This implies q=0qdq2dq. Thus, the total time unit required by Algorithm 4 to process all the transactions from all layer and sublayers at destination shard Sj is at most:

τtotalq=0qr=0H21c1(lmin{k,s}+dq)c1lH2min{k,s}+2c1dqH2. (2)

We can replace H2=c2logs as we have O(logs) sublayers (see Section 4.2.1) and dq=c3dlogs, where c2 and c3 are some positive constants, then Equation 2 becomes:

c1lc2logsmin{k,s}+2c1c3dlogsc2logs=>O(llogsmin{k,s}+dlog2s).

As discussed in Theorem 7, Ω(l+d) is a lower bound. Thus, we have that the competitive ratio of Algorithm 4 as O(logsmin{k,s}+log2s).

A.9 Pseudocode of Stateful Multi-Leader Scheduler

Algorithm 4 Stateful Multi-Leader Scheduler.