Search Results

Documents authored by Busch, Costas


Document
Fault-Tolerant Distributed Directories

Authors: Judith Beestermöller, Costas Busch, and Roger Wattenhofer

Published in: LIPIcs, Volume 292, 3rd Symposium on Algorithmic Foundations of Dynamic Networks (SAND 2024)


Abstract
Many fundamental distributed computing problems require coordinated access to a shared resource. A distributed directory is an overlay data structure on an asynchronous graph G that helps to access a shared token t. The directory supports three basic operations: publish, to initialize the directory, lookup, to read the contents of the token, and move, to get exclusive update access to the token. There are known directory schemes that achieve message complexity within polylog factors of the optimal cost with respect to the number of nodes n and the diameter D of G. Motivated by fault-tolerant distributed computing implementations, we consider the impact of edge failures on distributed directories. We give a distributed directory overlay data structure that can tolerate edge failures without disrupting the directory operations. The directory can be repaired concurrently while it processes directory operations. We analyze the impact of the faults on the amortized cost of the three directory operations compared to the optimal cost. We show that f edges failures increase the amortized competitive ratio of the operations by at most factor f. We also analyze the message complexity to repair the overlay structure, in terms of the number of messages that are sent and the maximum distance a message traverses. For an edge failure, the repair mechanism uses messages of size 𝒪(log n) that traverse distance at most D', the graph diameter after the fault. To our knowledge, this is the first asymptotic analysis of a fault-tolerant distributed directory.

Cite as

Judith Beestermöller, Costas Busch, and Roger Wattenhofer. Fault-Tolerant Distributed Directories. In 3rd Symposium on Algorithmic Foundations of Dynamic Networks (SAND 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 292, pp. 5:1-5:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{beestermoller_et_al:LIPIcs.SAND.2024.5,
  author =	{Beesterm\"{o}ller, Judith and Busch, Costas and Wattenhofer, Roger},
  title =	{{Fault-Tolerant Distributed Directories}},
  booktitle =	{3rd Symposium on Algorithmic Foundations of Dynamic Networks (SAND 2024)},
  pages =	{5:1--5:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-315-7},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{292},
  editor =	{Casteigts, Arnaud and Kuhn, Fabian},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SAND.2024.5},
  URN =		{urn:nbn:de:0030-drops-198833},
  doi =		{10.4230/LIPIcs.SAND.2024.5},
  annote =	{Keywords: distributed directory, sparse partition, fault tolerance, message complexity, path dilation}
}
Document
BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs

Authors: Samuel Irving, Lu Peng, Costas Busch, and Jih-Kwon Peir

Published in: OASIcs, Volume 88, 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021)


Abstract
We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions "appear" to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer.

Cite as

Samuel Irving, Lu Peng, Costas Busch, and Jih-Kwon Peir. BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs. In 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021). Open Access Series in Informatics (OASIcs), Volume 88, pp. 2:1-2:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


Copy BibTex To Clipboard

@InProceedings{irving_et_al:OASIcs.PARMA-DITAM.2021.2,
  author =	{Irving, Samuel and Peng, Lu and Busch, Costas and Peir, Jih-Kwon},
  title =	{{BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs}},
  booktitle =	{12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021)},
  pages =	{2:1--2:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-181-8},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{88},
  editor =	{Bispo, Jo\~{a}o and Cherubin, Stefano and Flich, Jos\'{e}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2021.2},
  URN =		{urn:nbn:de:0030-drops-136386},
  doi =		{10.4230/OASIcs.PARMA-DITAM.2021.2},
  annote =	{Keywords: GPU, Distributed Transactional Memory, Approximate Consistency}
}