Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion

Bankhamer, Gregor; Elsässer, Robert; Schmid, Stefan

doi:10.4230/LIPIcs.DISC.2021.9

File

Author Details

Gregor Bankhamer

Department of Computer Sciences, Universität Salzburg, Austria

Robert Elsässer

Department of Computer Sciences, Universität Salzburg, Austria

Stefan Schmid

TU Berlin, Germany
Faculty of Computer Science, Universität Wien, Austria

Cite AsGet BibTex

Gregor Bankhamer, Robert Elsässer, and Stefan Schmid. Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion. In 35th International Symposium on Distributed Computing (DISC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 209, pp. 9:1-9:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.DISC.2021.9

Abstract

To ensure high availability, datacenter networks must rely on local fast rerouting mechanisms that allow routers to quickly react to link failures, in a fully decentralized manner. However, configuring these mechanisms to provide a high resilience against multiple failures while avoiding congestion along failover routes is algorithmically challenging, as the rerouting rules can only depend on local failure information and must be defined ahead of time. This paper presents a randomized local fast rerouting algorithm for Clos networks, the predominant datacenter topologies. Given a graph G = (V,E) describing a Clos topology, our algorithm defines local routing rules for each node v ∈ V, which only depend on the packet’s destination and are conditioned on the incident link failures. We prove that as long as number of failures at each node does not exceed a certain bound, our algorithm achieves an asymptotically minimal congestion up to polyloglog factors along failover paths. Our lower bounds are developed under some natural routing assumptions.

Subject Classification

ACM Subject Classification

Theory of computation → Approximation algorithms analysis
Theory of computation → Distributed algorithms
Networks → Data path algorithms

Keywords

local failover routing
congestion
randomized algorithms
datacenter networks

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review, 38(4):63-74, 2008.
Gregor Bankhamer, Robert Elsaesser, and Stefan Schmid. Local fast rerouting with low congestion: A randomized approach. In Proc. 27th IEEE International Conference on Network Protocols (ICNP), 2020.
Gregor Bankhamer, Robert Elsässer, and Stefan Schmid. Randomized local fast rerouting for datacenter networks with almost optimal congestion, 2021. URL: http://arxiv.org/abs/2108.02136.
Michael Borokhovich and Stefan Schmid. How (not) to shoot in your foot with sdn local fast failover: A load-connectivity tradeoff. In Proc. International Conference on Principles of Distributed Systems (OPODIS), 2013.
Marco Chiesa, Andrei V. Gurtov, Aleksander Madry, Slobodan Mitrovic, Ilya Nikolaevskiy, Michael Schapira, and Scott Shenker. On the resiliency of randomized routing against multiple edge failures. In Proc. ICALP, 2016.
Marco Chiesa, Andrzej Kamisinski, Jacek Rak, Gabor Retvari, and Stefan Schmid. A survey of fast-recovery mechanisms in packet-switched networks. IEEE Communications Surveys and Tutorials (COMST), 2021.
Marco Chiesa, Ilya Nikolaevskiy, Slobodan Mitrovic, Andrei Gurtov, Aleksander Madry, Michael Schapira, and Scott Shenker. On the resiliency of static forwarding tables. IEEE/ACM Transactions on Networking (TON), 25(2):1133-1146, 2017.
Marco Chiesa, Ilya Nikolaevskiy, Slobodan Mitrovic, Aurojit Panda, Andrei Gurtov, Aleksander Madry, Michael Schapira, and Scott Shenker. The quest for resilient (static) forwarding tables. In Proc. IEEE INFOCOM, 2016.
Joan Feigenbaum, Brighten Godfrey, Aurojit Panda, Michael Schapira, Scott Shenker, and Ankit Singla. Brief announcement: On the resilience of routing tables. In Proc. ACM PODC, 2012.
Klaus-Tycho Foerster, Juho Hirvonen, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. On the feasibility of perfect resilience with local fast failover. In Proc. SIAM Symposium on Algorithmic Principles of Computer Systems (APOCS), 2021.
Klaus-Tycho Foerster, Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Casa: congestion and stretch aware static fast rerouting. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pages 469-477. IEEE, 2019.
Pierre Francois, Clarence Filsfils, John Evans, and Olivier Bonaventure. Achieving sub-second igp convergence in large ip networks. ACM SIGCOMM Computer Communication Review, 35(3):35-44, 2005.
E.M. Gafni and D.P. Bertsekas. Distributed algorithms for generating loop-free routes in networks with frequently changing topology. Trans. Commun., 29(1):11-18, 1981.
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 conference, pages 350-361, 2011.
Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W Moore, Gianni Antichi, and Marcin Wójcik. Re-architecting datacenter networks and stacks for low latency and high performance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 29-42, 2017.
Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. Flowbender: Flow-level adaptive routing for improved latency and throughput in datacenter networks. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pages 149-160, 2014.
Charles E Leiserson. Fat-trees: universal networks for hardware-efficient supercomputing. IEEE transactions on Computers, 100(10):892-901, 1985.
Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott Shenker. Ensuring connectivity via data plane mechanisms. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 113-126, 2013.
Grzegorz Malewicz, Alexander Russell, and Alexander A. Shvartsman. Distributed scheduling for disconnected cooperation. Distributed Computing, 18(6):409-420, 2005.
Mahmoud Parham, Klaus-Tycho Foerster, Petar Kosic, and Stefan Schmid. Maximally resilient replacement paths for a family of product graphs. In Proc. OPODIS, 2020.
Yvonne-Anne Pignolet, Stefan Schmid, and Gilles Tredan. Load-optimal local fast rerouting for resilient networks. In Proc. 47th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2017.
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, et al. Jupiter rising: A decade of clos topologies and centralized control in google’s datacenter network. ACM SIGCOMM computer communication review, 45(4):183-197, 2015.
János Tapolcai. Sufficient conditions for protection routing in ip networks. Optimization Letters, 7(4):723-730, 2013.
Haitao Wu, Zhenqian Feng, Chuanxiong Guo, and Yongguang Zhang. Ictcp: Incast congestion control for tcp in data-center networks. IEEE/ACM transactions on networking, 21(2):345-358, 2012.

Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion

Authors Gregor Bankhamer, Robert Elsässer, Stefan Schmid

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion

Authors Gregor Bankhamer, Robert Elsässer, Stefan Schmid

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message