DROPS

Document

DOI: 10.4230/OASIcs.NINeS.2026.26

Running Distributed Systems like Clockwork

Authors: Karan Newatia, Robert Gifford, Qingjie Lu, Andreas Haeberlen, and Linh Thi Xuan Phan

Published in: OASIcs, Volume 139, 1st New Ideas in Networked Systems (NINeS 2026)

Abstract

Distributed Systems are commonly built using a set of standard assumptions: we assume that message delays are unbounded, that any packet can be lost in the network, and that clocks cannot be closely synchronized. On the one hand, these conservative assumptions result in robust systems that can operate reliably in a wide variety of conditions. On the other hand, they also force the system to do a lot of complex ad-hoc coordination and thus limit the performance it can achieve. In this paper, we take a look at what lies beyond this standard model. We observe that, on modern hardware in a single-tenant data center, distributed systems are able to closely coordinate and essentially "run like clockwork" with very little effort. If we are willing to additionally rule out some worst-case failure scenarios, this results in a large performance improvement, both in practice and even in theory. We demonstrate this effect using state-machine replication (SMR) as a case study: our SMR protocol, Watchmaker, exceeds the throughput of state-of-the-art algorithms by two orders of magnitude, and it requires only half as many replicas to tolerate the same number of faults.

Cite as

Karan Newatia, Robert Gifford, Qingjie Lu, Andreas Haeberlen, and Linh Thi Xuan Phan. Running Distributed Systems like Clockwork. In 1st New Ideas in Networked Systems (NINeS 2026). Open Access Series in Informatics (OASIcs), Volume 139, pp. 26:1-26:31, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{newatia_et_al:OASIcs.NINeS.2026.26,
  author =	{Newatia, Karan and Gifford, Robert and Lu, Qingjie and Haeberlen, Andreas and Phan, Linh Thi Xuan},
  title =	{{Running Distributed Systems like Clockwork}},
  booktitle =	{1st New Ideas in Networked Systems (NINeS 2026)},
  pages =	{26:1--26:31},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-414-7},
  ISSN =	{2190-6807},
  year =	{2026},
  volume =	{139},
  editor =	{Argyraki, Katerina and Panda, Aurojit},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.NINeS.2026.26},
  URN =		{urn:nbn:de:0030-drops-256115},
  doi =		{10.4230/OASIcs.NINeS.2026.26},
  annote =	{Keywords: State-machine replication, distributed systems, data centers, clock synchronization, fault tolerance, synchrony}
}

Document

DOI: 10.4230/DagSemProc.06371.3

Abstracting out Byzantine Behavior

Authors: Peter Druschel, Andreas Haeberlen, and Petr Kouznetsov

Published in: Dagstuhl Seminar Proceedings, Volume 6371, From Security to Dependability (2007)

Abstract

Many distributed systems are designed to tolerate the presence of emph{Byzantine} failures: an individual process may arbitrarily deviate from the algorithm assigned to it. Depending on the application requirements, systems enjoy various levels of fault-tolerance. Systems based on state machine replication are able to emph{mask} failures so that their effect is not visible by the application. In contrast, cooperative peer-to-peer systems can tolerate bounded deviant behavior to some extent and therefore do not require masking, as long as each faulty node is emph{exposed}eventually. Finding an abstract way to reason about the levels of fault-tolerance is thus of immanent importance. We discuss how the information of deviant behavior can be abstracted out in the form of a emph{Byzantine failure detector} (BFD). We formally define a BFD abstraction, and we discuss two ways of using the abstraction: (1) monitoring systems in order to retroactively detect Byzantine failures and (2) enforcing systems in order to boost their level of fault-tolerance. Interestingly, the BFD formalism allowed us to determine the relative hardness of implementing two popular abstractions in distributed computing: state machine replication and weak interactive consistency.

Cite as

Peter Druschel, Andreas Haeberlen, and Petr Kouznetsov. Abstracting out Byzantine Behavior. In From Security to Dependability. Dagstuhl Seminar Proceedings, Volume 6371, pp. 1-12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2007)

Copy BibTex To Clipboard

@InProceedings{druschel_et_al:DagSemProc.06371.3,
  author =	{Druschel, Peter and Haeberlen, Andreas and Kouznetsov, Petr},
  title =	{{Abstracting out Byzantine Behavior}},
  booktitle =	{From Security to Dependability},
  pages =	{1--12},
  series =	{Dagstuhl Seminar Proceedings (DagSemProc)},
  ISSN =	{1862-4405},
  year =	{2007},
  volume =	{6371},
  editor =	{Christian Cachin and Felix C. Freiling and Jaap-Henk Hoepman},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagSemProc.06371.3},
  URN =		{urn:nbn:de:0030-drops-8501},
  doi =		{10.4230/DagSemProc.06371.3},
  annote =	{Keywords: Fault-tolerance, Byzantine failures, masking, detection, total order broadcast, weak interactive consistency}
}

Search Results

Documents authored by Haeberlen, Andreas

Running Distributed Systems like Clockwork

Abstract

Cite as

Abstracting out Byzantine Behavior

Abstract

Cite as

Thanks for your feedback!

Could not send message