Brussels / 1 & 2 February 2025

schedule

Job-specific performance monitoring on HPC clusters: Challenges and Solutions


Traditional monitoring in high-performance computing (HPC) clusters primarily supports administrators in maintaining and managing their systems. However, especially performance data is also a valuable resource for users aiming to optimize their applications and support personnel to identify jobs that abuse the HPC system or identify users and projects that need help.

To address the needs of all roles in an HPC environment, administrators, support personnel, project managers, and users, we developed Cluster Cockpit, an open-source monitoring framework that is easy to deploy and maintain that offers a powerful web-interface with specific views for the different HPC roles. ClusterCockpit can cover multiple HPC clusters from small to large scale. By providing an intuitive user interface, ClusterCockpit simplifies performance analysis and enhances usability for diverse stakeholders. While analyzing job profiles can provide critical insights, manual analysis often requires significant time and expertise. To streamline this process, we developed Patho-Jobs, an automated tool that leverages data from ClusterCockpit to detect underperforming jobs or those requiring intervention. In this presentation, we will introduce the core concepts of ClusterCockpit and Patho-Jobs, along with insights gained from their deployment at German National High-Performance Computing (NHR) sites.

ClusterCockpit: https://clustercockpit.org/ Patho-Jobs: https://git-ce.rwth-aachen.de/pathojobs/

Speakers

Christian Iwainsky