AKSEL: Fast Byzantine SGD

Boussetta, Amine; El-Mhamdi, El-Mahdi; Guerraoui, Rachid; Maurer, Alexandre; Rouault, Sébastien

doi:10.4230/LIPIcs.OPODIS.2020.8

Abstract

Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed stochastic gradient descent (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to average the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative gradient aggregation rules (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small (s < < f). 
We propose Aksel, a new scalable median-based GAR with optimal time complexity (𝒪(nd)), optimal breakdown point (n > 2f) and the lowest upper bound on the expected angular error (𝒪(√d)) among full gradient approaches. We also study the actual angular error of Aksel when the gradient distribution is normal and show that it only grows in 𝒪(√dlog{n}), which is the first logarithmic upper bound ever proven on the number of workers n assuming an optimal breakdown point. We also report on an empirical evaluation of Aksel on various classification tasks, which we compare to alternative GARs against state-of-the-art attacks. Aksel is the only GAR reaching top accuracy when there is actually none or few Byzantine workers while maintaining a good defense even under the extreme case (s = f). For simplicity of presentation, we consider a scheme with a single server. However, as we explain in the paper, Aksel can also easily be adapted to multi-server architectures that tolerate the Byzantine behavior of a fraction of the servers.

Martin Abadi et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265-283, 2016.
Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 4613-4623, 2018.
Moran Baruch, Gilad Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning, 2019. URL: http://arxiv.org/abs/1902.06156.
Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30, pages 119-129. Curran Associates, Inc., 2017.
Léon Bottou. On-Line Learning and Stochastic Approximations, page 9–42. Cambridge University Press, USA, 1999.
Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223-311, 2018.
Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. DRACO: Byzantine-resilient distributed training via redundant gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 903-912. PMLR, 2018.
Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Sébastien Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. In SysML, 2019.
Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki, et al. Asynchronous byzantine machine learning (the case of sgd). In ICML, pages 1153-1162, 2018.
El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Lê Nguyên Hoang. Geniunely distributed byzantine machine learning. In PODC, 2020.
El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning, pages 3521-3530. PMLR, 2018.
Yuri Fateev, Vladimir Shaydurov, Evgeny Garin, Dmitry Dmitriev, and Valeriy Tyapkin. Probability distribution functions of the sum of squares of random variables in the non-zero mathematical expectations. Journal of Siberian Federal University. Mathematics & Physics, 9:173-179, 2016.
Matthias Függer and Thomas Nowak. Fast Multidimensional Asymptotic and Approximate Consensus. In 32nd International Symposium on Distributed Computing (DISC 2018), volume 121 of Leibniz International Proceedings in Informatics (LIPIcs), pages 27:1-27:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
E.J. Gumbel. Statistics of Extremes. Dover books on mathematics. Dover Publications, 2004.
Robert Hecht-Nielsen. Theory of the backpropagation neural network. In Neural networks for perception, pages 65-93. Elsevier, 1992.
C. A. R. Hoare. Algorithm 65: Find. Commun. ACM, 4(7):321–322, 1961.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571-8580, 2018.
S. Kotz and S. Nadarajah. Extreme Value Distributions. World Scientific Publishing Company, 2000.
Hammurabi Mendes, Maurice Herlihy, Nitin Vaidya, and Vijay Garg. Multidimensional agreement in byzantine systems. Distributed Computing, 28:1-19, 2015.
F. Mosteller and J.W. Tukey. Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Series in Behavioral Science, 1977.
Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáč. Sgd and hogwild! convergence without the bounded gradients assumption, 2018.
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
Peter J Rousseeuw. Multivariate estimation with high breakdown point. Mathematical statistics and applications, 8:283-297, 1985.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv., 22(4):299–319, 1990.
V. Vapnik. Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 831-838. Morgan-Kaufmann, 1992.
Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized byzantine-tolerant sgd, 2018.
Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Phocas: dimensional byzantine-resilient stochastic gradient descent, 2018.
Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In UAI, volume 115 of Proceedings of Machine Learning Research, pages 261-270. PMLR, 2020.
Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650-5659. PMLR, 2018.

AKSEL: Fast Byzantine SGD

Authors Amine Boussetta, El-Mahdi El-Mhamdi, Rachid Guerraoui, Alexandre Maurer, Sébastien Rouault

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message