Brussels / 31 January & 1 February 2026

schedule

Track Energy & Emissions of User Jobs on HPC/AI Platforms using CEEMS


With the rapid acceleration of ML/AI research in the last couple of years, the already energy-hungry HPC platforms have become even more demanding. A major part of this energy consumption is due to users’ workloads and it is only by the participation of end users that it is possible to reduce the overall energy consumption of the platforms. However, most of the HPC platforms do not provide any sort of metrics related to energy consumption, nor the performance metrics out of the box, which in turn do not encourage end users to optimize their workloads.

The Compute Energy & Emissions Monitoring Stack (CEEMS) has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for SLURM (HPC), Openstack (Cloud) and Kubernetes platforms alike. It leverages the Linux perf subsystem and eBPF to monitor the performance metrics of the applications, which can help the end users to identify the bottlenecks in their workflows rapidly and consequently optimize them to reduce the energy and carbon footprint. CEEMS supports eBPF-based continuous profiling and it is the first monitoring stack to support continuous profiling on HPC platforms. Another advantage of CEEMS is that it can systematically monitor all the jobs on the platform without the end users having to modify their workflows or codes.

Besides CPU energy usage, it supports reporting energy usage and performance metrics of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools in the observability ecosystem, like Prometheus and Grafana. CEEMS has been designed to be extensible and it allows the HPC center operators to easily define the energy estimation rules of user workloads based on the underlying hardware. CEEMS monitors I/O and network metrics in a file system agnostic manner, allowing it to work on any parallel file system used by HPC platforms. Finally, the talk will conclude by showing how CEEMS monitoring is used on the Jean-Zay HPC platform with more than 2000 nodes that have a daily job churn rate of around 20k jobs.

Speakers

Mahendra Paipuri

Links