DROPS

Document

Keynote

DOI: 10.4230/LIPIcs.OPODIS.2018.3

How to Make Decisions (Optimally) (Keynote)

Authors: Siddhartha Sen

Published in: LIPIcs, Volume 125, 22nd International Conference on Principles of Distributed Systems (OPODIS 2018)

Abstract

Distributed systems are constantly faced with difficult decisions to make, such as in scheduling, caching, and traffic routing, to name a few. In most of these scenarios, the optimal decision is unknown and depends heavily on context. How can a system designer know if they have deployed the best decision-making policy, or if a different policy would perform better? As a community, we have developed a few methodologies for answering this question, some of them offline (e.g., simulation, trace-driven modeling) and some of them online (e.g., A/B testing). Neither approach is satisfactory: the offline methods suffer from bias and rely heavily on domain knowledge; the online methods are costly and difficult to deploy. What system designers ideally seek is the ability to ask "what if" questions about a policy without ever deploying it, which is called counterfactual evaluation. In this talk, I will show how reinforcement learning and causal inference can be synthesized to counterfactually evaluate a distributed system. We will apply this methodology to infrastructure systems in Azure, and face fundamental challenges and opportunities along the way. This talk will serve as an introduction to reinforcement learning and the counterfactual way of thinking, which I hope will interest and inspire the OPODIS community. I will start by introducing reinforcement learning (RL) as the right framework for modeling decisions in a distributed system. In RL, an agent learns by interacting with its environment: i.e., making decisions and receiving feedback for them. This is a stark contrast to traditional (supervised) learning, where the correct answer, or "label", is known. Since an RL agent does not know the correct answer, it must constantly explore its world by randomizing some of its decisions. Now it turns out that this randomization, if used correctly, can give us a special superpower: the ability to evaluate policies that have never been deployed. As magical as this may sound, we can use statistics to show that this evaluation is indeed correct. Unfortunately, applying this methodology to distributed systems is far from straightforward. Systems are complex, stateful amalgamations of components that navigate large decision spaces. We will need to wear both an RL hat and a systems hat to address these challenges. On the other hand, systems also present exciting opportunities. Many systems already use randomization in their decisions, e.g., to distribute data or work over replicas, or to manage resource contention. Sometimes, a conservative decision can implicitly yield feedback for other decisions: for example, when waiting for a timeout to expire, we automatically get feedback for what would have happened if we waited for any shorter amount of time. I will show how we can harvest this randomness and implicit feedback to achieve more effective counterfactual evaluation. We will apply all of the above ideas to two production infrastructure systems in Azure: a machine health monitor that decides when to reboot unresponsive machines, and a geo-distributed edge proxy that chooses the TCP configuration of each proxy machine. In both cases, we are able to counterfactually evaluate arbitrary policies with estimates that match the ground truth. Production environments raise interesting constraints and challenges, some of which are preventing us from scaling up our methodology. I will describe a possible path forward, and invite others in the community to contemplate these problems as well.

Cite as

Siddhartha Sen. How to Make Decisions (Optimally) (Keynote). In 22nd International Conference on Principles of Distributed Systems (OPODIS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 125, p. 3:1, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{sen:LIPIcs.OPODIS.2018.3,
  author =	{Sen, Siddhartha},
  title =	{{How to Make Decisions (Optimally)}},
  booktitle =	{22nd International Conference on Principles of Distributed Systems (OPODIS 2018)},
  pages =	{3:1--3:1},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-098-9},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{125},
  editor =	{Cao, Jiannong and Ellen, Faith and Rodrigues, Luis and Ferreira, Bernardo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2018.3},
  URN =		{urn:nbn:de:0030-drops-100638},
  doi =		{10.4230/LIPIcs.OPODIS.2018.3},
  annote =	{Keywords: reinforcement learning, distributed systems, counterfactual evaluation}
}

Document

Brief Announcement

DOI: 10.4230/LIPIcs.DISC.2017.45

Brief Announcement: Black-Box Concurrent Data Structures for NUMA Architectures

Authors: Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera

Published in: LIPIcs, Volume 91, 31st International Symposium on Distributed Computing (DISC 2017)

Abstract

Recent work introduced a method to automatically produce concurrent data structures for NUMA architectures. We present a summary of that work.

Cite as

Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera. Brief Announcement: Black-Box Concurrent Data Structures for NUMA Architectures. In 31st International Symposium on Distributed Computing (DISC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 91, pp. 45:1-45:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{calciu_et_al:LIPIcs.DISC.2017.45,
  author =	{Calciu, Irina and Sen, Siddhartha and Balakrishnan, Mahesh and Aguilera, Marcos K.},
  title =	{{Brief Announcement: Black-Box Concurrent Data Structures for NUMA Architectures}},
  booktitle =	{31st International Symposium on Distributed Computing (DISC 2017)},
  pages =	{45:1--45:3},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-053-8},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{91},
  editor =	{Richa, Andr\'{e}a},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.DISC.2017.45},
  URN =		{urn:nbn:de:0030-drops-80122},
  doi =		{10.4230/LIPIcs.DISC.2017.45},
  annote =	{Keywords: concurrent data structures, log, NUMA architecture, replication}
}

Search Results

Documents authored by Sen, Siddhartha

How to Make Decisions (Optimally) (Keynote)

Abstract

Cite as

Brief Announcement: Black-Box Concurrent Data Structures for NUMA Architectures

Abstract

Cite as

Thanks for your feedback!

Could not send message