License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.DISC.2018.2
URN: urn:nbn:de:0030-drops-97910
Go to the corresponding LIPIcs Volume Portal

Goldstein, Tom

Challenges for Machine Learning on Distributed Platforms (Invited Talk)

LIPIcs-DISC-2018-2.pdf (0.3 MB)


Deep neural networks are trained by solving huge optimization problems with large datasets and millions of variables. On the surface, it seems that the size of these problems makes them a natural target for distributed computing. Despite this, most deep learning research still takes place on a single compute node with a small number of GPUs, and only recently have researchers succeeded in unlocking the power of HPC. In this talk, we'll give a brief overview of how deep networks are trained, and use HPC tools to explore and explain deep network behaviors. Then, we'll explain the problems and challenges that arise when scaling deep nets over large system, and highlight reasons why naive distributed training methods fail. Finally, we'll discuss recent algorithmic innovations that have overcome these limitations, including "big batch" training for tightly coupled clusters and supercomputers, and "variance reduction" strategies to reduce communication in high latency settings.

BibTeX - Entry

  author =	{Tom Goldstein},
  title =	{{Challenges for Machine Learning on Distributed Platforms (Invited Talk)}},
  booktitle =	{32nd International Symposium on Distributed Computing  (DISC 2018)},
  pages =	{2:1--2:3},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-092-7},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{121},
  editor =	{Ulrich Schmid and Josef Widder},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  URN =		{urn:nbn:de:0030-drops-97910},
  doi =		{10.4230/LIPIcs.DISC.2018.2},
  annote =	{Keywords: Machine learning, distributed optimization}

Keywords: Machine learning, distributed optimization
Collection: 32nd International Symposium on Distributed Computing (DISC 2018)
Issue Date: 2018
Date of publication: 04.10.2018

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI