Online / 5 & 6 February 2022


Exascale PMI on a heterogeneous sub-exascale Slurm cluster

PMIx (Process Management Interface exascale) is a de-facto standard providing a very efficient interface to launch and control distributed tasks. It was created for exascale HPC systems, where launching a computational job can involve tens of thousands of nodes and bootstrapping MPI (Message Passing Interface) becomes cumbersome. PMIx reduces launch times in such systems from minutes to a few seconds.

Even though the lower launch times are less critical in smaller clusters (but always welcome), the high efficiency of PMIx is also desirable at sub-exascale. The low data footprint and data exchange, as well as leveraging fast interconnects is useful on systems with lower-end network fabrics. Moreover, its tight integration with the resource manager is very helpful to minimize idling on clusters with limited resources.

At VUB (Vrije Universiteit Brussel), we have recently transitioned our tier-2 HPC cluster to Slurm and enabled PMIx in a mixture of TCP and InfiniBand networks. We will share the lessons learned in the process and practical tips to deploy a reliable setup with open source software.


Photo of Alex Domingo Alex Domingo