AKSEL: Fast Byzantine SGD

Authors Amine Boussetta, El-Mahdi El-Mhamdi, Rachid Guerraoui, Alexandre Maurer, Sébastien Rouault

Thumbnail PDF


  • Filesize: 1.31 MB
  • 16 pages

Document Identifiers

Author Details

Amine Boussetta
  • Mohammed VI Polytechnic University, Ben Guerir, Morocco
El-Mahdi El-Mhamdi
  • EPFL, Lausanne, Switzerland
Rachid Guerraoui
  • EPFL, Lausanne, Switzerland
Alexandre Maurer
  • Mohammed VI Polytechnic University, Ben Guerir, Morocco
Sébastien Rouault
  • EPFL, Lausanne, Switzerland

Cite AsGet BibTex

Amine Boussetta, El-Mahdi El-Mhamdi, Rachid Guerraoui, Alexandre Maurer, and Sébastien Rouault. AKSEL: Fast Byzantine SGD. In 24th International Conference on Principles of Distributed Systems (OPODIS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 184, pp. 8:1-8:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed stochastic gradient descent (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to average the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative gradient aggregation rules (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small (s < < f). We propose Aksel, a new scalable median-based GAR with optimal time complexity (𝒪(nd)), optimal breakdown point (n > 2f) and the lowest upper bound on the expected angular error (𝒪(√d)) among full gradient approaches. We also study the actual angular error of Aksel when the gradient distribution is normal and show that it only grows in 𝒪(√dlog{n}), which is the first logarithmic upper bound ever proven on the number of workers n assuming an optimal breakdown point. We also report on an empirical evaluation of Aksel on various classification tasks, which we compare to alternative GARs against state-of-the-art attacks. Aksel is the only GAR reaching top accuracy when there is actually none or few Byzantine workers while maintaining a good defense even under the extreme case (s = f). For simplicity of presentation, we consider a scheme with a single server. However, as we explain in the paper, Aksel can also easily be adapted to multi-server architectures that tolerate the Byzantine behavior of a fraction of the servers.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Batch learning
  • Security and privacy → Distributed systems security
  • Theory of computation → Nonconvex optimization
  • Machine learning
  • Stochastic gradient descent
  • Byzantine failures


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Martin Abadi et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265-283, 2016. Google Scholar
  2. Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 4613-4623, 2018. Google Scholar
  3. Moran Baruch, Gilad Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning, 2019. URL: http://arxiv.org/abs/1902.06156.
  4. Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30, pages 119-129. Curran Associates, Inc., 2017. Google Scholar
  5. Léon Bottou. On-Line Learning and Stochastic Approximations, page 9–42. Cambridge University Press, USA, 1999. Google Scholar
  6. Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223-311, 2018. Google Scholar
  7. Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. DRACO: Byzantine-resilient distributed training via redundant gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 903-912. PMLR, 2018. Google Scholar
  8. Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017. Google Scholar
  9. Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Sébastien Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. In SysML, 2019. Google Scholar
  10. Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki, et al. Asynchronous byzantine machine learning (the case of sgd). In ICML, pages 1153-1162, 2018. Google Scholar
  11. El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Lê Nguyên Hoang. Geniunely distributed byzantine machine learning. In PODC, 2020. Google Scholar
  12. El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning, pages 3521-3530. PMLR, 2018. Google Scholar
  13. Yuri Fateev, Vladimir Shaydurov, Evgeny Garin, Dmitry Dmitriev, and Valeriy Tyapkin. Probability distribution functions of the sum of squares of random variables in the non-zero mathematical expectations. Journal of Siberian Federal University. Mathematics & Physics, 9:173-179, 2016. Google Scholar
  14. Matthias Függer and Thomas Nowak. Fast Multidimensional Asymptotic and Approximate Consensus. In 32nd International Symposium on Distributed Computing (DISC 2018), volume 121 of Leibniz International Proceedings in Informatics (LIPIcs), pages 27:1-27:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  15. E.J. Gumbel. Statistics of Extremes. Dover books on mathematics. Dover Publications, 2004. Google Scholar
  16. Robert Hecht-Nielsen. Theory of the backpropagation neural network. In Neural networks for perception, pages 65-93. Elsevier, 1992. Google Scholar
  17. C. A. R. Hoare. Algorithm 65: Find. Commun. ACM, 4(7):321–322, 1961. Google Scholar
  18. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571-8580, 2018. Google Scholar
  19. S. Kotz and S. Nadarajah. Extreme Value Distributions. World Scientific Publishing Company, 2000. Google Scholar
  20. Hammurabi Mendes, Maurice Herlihy, Nitin Vaidya, and Vijay Garg. Multidimensional agreement in byzantine systems. Distributed Computing, 28:1-19, 2015. Google Scholar
  21. F. Mosteller and J.W. Tukey. Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Series in Behavioral Science, 1977. Google Scholar
  22. Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takáč. Sgd and hogwild! convergence without the bounded gradients assumption, 2018. Google Scholar
  23. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. Google Scholar
  24. Peter J Rousseeuw. Multivariate estimation with high breakdown point. Mathematical statistics and applications, 8:283-297, 1985. Google Scholar
  25. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985. Google Scholar
  26. Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv., 22(4):299–319, 1990. Google Scholar
  27. V. Vapnik. Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 831-838. Morgan-Kaufmann, 1992. Google Scholar
  28. Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized byzantine-tolerant sgd, 2018. Google Scholar
  29. Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Phocas: dimensional byzantine-resilient stochastic gradient descent, 2018. Google Scholar
  30. Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In UAI, volume 115 of Proceedings of Machine Learning Research, pages 261-270. PMLR, 2020. Google Scholar
  31. Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650-5659. PMLR, 2018. Google Scholar