Challenges to Achieving High Availability at Scale (Invited Talk)

Author Wolfram Schulte



PDF
Thumbnail PDF

File

LIPIcs.ECOOP.2017.1.pdf
  • Filesize: 201 kB
  • 1 pages

Document Identifiers

Author Details

Wolfram Schulte

Cite AsGet BibTex

Wolfram Schulte. Challenges to Achieving High Availability at Scale (Invited Talk). In 31st European Conference on Object-Oriented Programming (ECOOP 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 74, p. 1:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.ECOOP.2017.1

Abstract

Facebook is a social network that connects more than 1.8 billion people. To serve these many users requires infrastructure which is composed of thousands of interdependent systems that span geographically distributed data centers. But what is the guiding principle for building and operating these systems? For Facebook’s infrastructure teams the answer is: Systems must always be available and never lose data. This talk will explore this quest. We will focus on three aspects. Availability and consistency. What form of consistency do Facebook’s systems guarantee? Strong consistency makes understanding easy but has latency penalties, weak consistency is fast but difficult to reason for developers and users. We describe our usage of eventual consistency and delve into how Facebook constructs its caching and replicated storage systems to minimize the duration for achieving consistency. We share empirical data that measures the effectiveness of our design. Availability and correctness. With network partitions, relaxed forms of consistency, and software bugs, how do we guarantee a consistent state? We present two systems to find and repair structural errors in Facebook’s social graph, one batch and one real-time. Availability and scale. Sharding is one of the standard answers to operate at scale. But how can we develop one system that can shard storage as well as compute? We will introduce a new Sharding-as-a-Service component. We will show and evaluate how its design and service policies control for latency, failure tolerance and operationally efficiency.
Keywords
  • Distributed Systems
  • Availability
  • Reliability
  • Fault Tolerance
  • Consistency
  • Scalability
  • Replication
  • Sharding
  • Caching

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail