Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion

Authors Gregor Bankhamer, Robert Elsässer, Stefan Schmid



PDF
Thumbnail PDF

File

LIPIcs.DISC.2021.9.pdf
  • Filesize: 0.81 MB
  • 19 pages

Document Identifiers

Author Details

Gregor Bankhamer
  • Department of Computer Sciences, Universität Salzburg, Austria
Robert Elsässer
  • Department of Computer Sciences, Universität Salzburg, Austria
Stefan Schmid
  • TU Berlin, Germany
  • Faculty of Computer Science, Universität Wien, Austria

Cite AsGet BibTex

Gregor Bankhamer, Robert Elsässer, and Stefan Schmid. Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion. In 35th International Symposium on Distributed Computing (DISC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 209, pp. 9:1-9:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.DISC.2021.9

Abstract

To ensure high availability, datacenter networks must rely on local fast rerouting mechanisms that allow routers to quickly react to link failures, in a fully decentralized manner. However, configuring these mechanisms to provide a high resilience against multiple failures while avoiding congestion along failover routes is algorithmically challenging, as the rerouting rules can only depend on local failure information and must be defined ahead of time. This paper presents a randomized local fast rerouting algorithm for Clos networks, the predominant datacenter topologies. Given a graph G = (V,E) describing a Clos topology, our algorithm defines local routing rules for each node v ∈ V, which only depend on the packet’s destination and are conditioned on the incident link failures. We prove that as long as number of failures at each node does not exceed a certain bound, our algorithm achieves an asymptotically minimal congestion up to polyloglog factors along failover paths. Our lower bounds are developed under some natural routing assumptions.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
  • Theory of computation → Distributed algorithms
  • Networks → Data path algorithms
Keywords
  • local failover routing
  • congestion
  • randomized algorithms
  • datacenter networks

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review, 38(4):63-74, 2008. Google Scholar
  2. Gregor Bankhamer, Robert Elsaesser, and Stefan Schmid. Local fast rerouting with low congestion: A randomized approach. In Proc. 27th IEEE International Conference on Network Protocols (ICNP), 2020. Google Scholar
  3. Gregor Bankhamer, Robert Elsässer, and Stefan Schmid. Randomized local fast rerouting for datacenter networks with almost optimal congestion, 2021. URL: http://arxiv.org/abs/2108.02136.
  4. Michael Borokhovich and Stefan Schmid. How (not) to shoot in your foot with sdn local fast failover: A load-connectivity tradeoff. In Proc. International Conference on Principles of Distributed Systems (OPODIS), 2013. Google Scholar
  5. Marco Chiesa, Andrei V. Gurtov, Aleksander Madry, Slobodan Mitrovic, Ilya Nikolaevskiy, Michael Schapira, and Scott Shenker. On the resiliency of randomized routing against multiple edge failures. In Proc. ICALP, 2016. Google Scholar
  6. Marco Chiesa, Andrzej Kamisinski, Jacek Rak, Gabor Retvari, and Stefan Schmid. A survey of fast-recovery mechanisms in packet-switched networks. IEEE Communications Surveys and Tutorials (COMST), 2021. Google Scholar
  7. Marco Chiesa, Ilya Nikolaevskiy, Slobodan Mitrovic, Andrei Gurtov, Aleksander Madry, Michael Schapira, and Scott Shenker. On the resiliency of static forwarding tables. IEEE/ACM Transactions on Networking (TON), 25(2):1133-1146, 2017. Google Scholar
  8. Marco Chiesa, Ilya Nikolaevskiy, Slobodan Mitrovic, Aurojit Panda, Andrei Gurtov, Aleksander Madry, Michael Schapira, and Scott Shenker. The quest for resilient (static) forwarding tables. In Proc. IEEE INFOCOM, 2016. Google Scholar
  9. Joan Feigenbaum, Brighten Godfrey, Aurojit Panda, Michael Schapira, Scott Shenker, and Ankit Singla. Brief announcement: On the resilience of routing tables. In Proc. ACM PODC, 2012. Google Scholar
  10. Klaus-Tycho Foerster, Juho Hirvonen, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. On the feasibility of perfect resilience with local fast failover. In Proc. SIAM Symposium on Algorithmic Principles of Computer Systems (APOCS), 2021. Google Scholar
  11. Klaus-Tycho Foerster, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Casa: congestion and stretch aware static fast rerouting. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 469-477. IEEE, 2019. Google Scholar
  12. Pierre Francois, Clarence Filsfils, John Evans, and Olivier Bonaventure. Achieving sub-second igp convergence in large ip networks. ACM SIGCOMM Computer Communication Review, 35(3):35-44, 2005. Google Scholar
  13. E.M. Gafni and D.P. Bertsekas. Distributed algorithms for generating loop-free routes in networks with frequently changing topology. Trans. Commun., 29(1):11-18, 1981. Google Scholar
  14. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 conference, pages 350-361, 2011. Google Scholar
  15. Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W Moore, Gianni Antichi, and Marcin Wójcik. Re-architecting datacenter networks and stacks for low latency and high performance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 29-42, 2017. Google Scholar
  16. Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pages 149-160, 2014. Google Scholar
  17. Charles E Leiserson. Fat-trees: universal networks for hardware-efficient supercomputing. IEEE transactions on Computers, 100(10):892-901, 1985. Google Scholar
  18. Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott Shenker. Ensuring connectivity via data plane mechanisms. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 113-126, 2013. Google Scholar
  19. Grzegorz Malewicz, Alexander Russell, and Alexander A. Shvartsman. Distributed scheduling for disconnected cooperation. Distributed Computing, 18(6):409-420, 2005. Google Scholar
  20. Mahmoud Parham, Klaus-Tycho Foerster, Petar Kosic, and Stefan Schmid. Maximally resilient replacement paths for a family of product graphs. In Proc. OPODIS, 2020. Google Scholar
  21. Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Load-optimal local fast rerouting for resilient networks. In Proc. 47th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2017. Google Scholar
  22. Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, et al. Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network. ACM SIGCOMM computer communication review, 45(4):183-197, 2015. Google Scholar
  23. János Tapolcai. Sufficient conditions for protection routing in ip networks. Optimization Letters, 7(4):723-730, 2013. Google Scholar
  24. Haitao Wu, Zhenqian Feng, Chuanxiong Guo, and Yongguang Zhang. Ictcp: Incast congestion control for tcp in data-center networks. IEEE/ACM transactions on networking, 21(2):345-358, 2012. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail