Brussels / 3 & 4 February 2024


Kubernetes and HPC: Bare-metal bros

The high performance computing (HPC) community has an often tumultuous relationship with cloud, with advocates from both sides facing off in a West Side Story style fight. The costs, historical lack of focus on performance, and lack of transparency about supply of cloud resources is problematic. The lack of consistent, automated orchestration on HPC poses challenges for running complex emerging workflows. Cloud is growing rapidly and is projected to be the largest sector of computing by revenue 2025. Cloud's market-favored position makes it a primary source and beneficiary of innovations, allowing it to address limitations of HPC. HPC needs a strategy for integrating cloud technologies and collaborating with the cloud community. If our HPC community cannot successfully integrate cloud technologies and techniques we (and our science) will be left behind.

To establish this foundation the community must adopt a new strategy - converged computing, or creating environments and technologies that combine the best of both worlds. In this talk I will discuss our innovative work toward this goal – implementing HPC technologies inside of Kubernetes, the de-facto standard workload orchestration framework, and Kubernetes inside of HPC. I will provide a brief history of our early work, including improving the default Kubernetes scheduler by way of testing more sophisticated graph-based scheduling (a project called Fluence), and implementing the entirety of an HPC workload manager inside of Kubernetes (the Flux Operator). Finally, I present the ultimate turducken - allowing for running machine learning and HPC applications (including the Flux Operator itself) inside of Kubernetes, inside of Flux. With this approach, we imagine a collaborative future where HPC users can deploy workloads seamlessly across environments, and encounter equality of automation and performance.



Photo of Vanessa Sochat Vanessa Sochat