Driving HPC Operations With Holistic Monitoring and Operational Data Analytics (Dagstuhl Seminar 23171)

Authors Jim Brandt, Florina Ciorba, Ann Gentile, Michael Ott, Torsten Wilde and all authors of the abstracts in this report



PDF
Thumbnail PDF

File

DagRep.13.4.98.pdf
  • Filesize: 2.01 MB
  • 23 pages

Document Identifiers

Author Details

Jim Brandt
  • Sandia National Laboratories, US
Florina Ciorba
  • University of Basel, CH
Ann Gentile
  • Sandia National Laboratories, US
Michael Ott
  • Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities, DE
Torsten Wilde
  • Hewlett Packard Enterprise - Böblingen, DE
and all authors of the abstracts in this report

Cite AsGet BibTex

Jim Brandt, Florina Ciorba, Ann Gentile, Michael Ott, and Torsten Wilde. Driving HPC Operations With Holistic Monitoring and Operational Data Analytics (Dagstuhl Seminar 23171). In Dagstuhl Reports, Volume 13, Issue 4, pp. 98-120, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/DagRep.13.4.98

Abstract

Advances in analytic approaches have brought the vision of efficient High Performance Computing (HPC) operations enabled by dynamic analysis driving automated feedback and adaptation within reach. Many HPC centers have started the development and deployment of frameworks to enable continuous and holistic monitoring, archiving, and analysis of performance data from their production machines and related infrastructures. The impact of such frameworks rests upon the ability to effectively analyze such data and to take action based on analysis results. Analytic techniques have been successfully developed and applied in other domains but their features may not apply directly to HPC operations data and situations. Response options are limited in HPC implementations. Leveraging, adapting, and extending analysis techniques and response options would open up new avenues for research and development of actionable analytics that can drive more intelligent operations through both manual and automated response to conditions of interest. This Dagstuhl Seminar 23171 brought together practitioners and researchers in the areas of HPC system management and monitoring, analytics, and computer science to collaboratively work on developing community solutions for revolutionizing HPC system operations. The topics discussed in this seminar spanned use cases, data and analytic approaches required to address the use cases, use of analysis results to improve performance and operations, and research in the development and use of autonomous feedback loops.

Subject Classification

ACM Subject Classification
  • Information systems → Data analytics
Keywords
  • Monitoring
  • Operational Data Analytics
  • Dagstuhl Seminar
  • WAFVR

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail