Brussels / 3 & 4 February 2024


How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads

Kubernetes has become the platform of choice for container orchestration. High Performance Computing (HPC) workloads are underserved by the Kubernetes community. Various problems exist and a work group was established to identity and implement features in Kubernetes. In this talk, we will give a brief overview of the different components of Kubernetes and explain what work is being done to improve the experience of a HPC user on a Kubernetes cluster. The major focuses of this talk will be on the enhancements made to the Job API in Kubernetes. Various features for the Job API that will be expanded upon are Indexed Jobs, Suspended Jobs, Pod Failure Policy, and the Pod Replacement Policy. Even with this features in Job API, there is a need for an API that can represent a collection of Jobs. I will also present the JobSet project and give examples of how one can use this to run some common HPC patterns.


Kevin Hannon