Zero‑Touch HPC Nodes: NetBox, Tofu and Packer for a Self‑Configuring SLURM Cluster
- Track: HPC, Big Data & Data Science
- Room: H.1308 (Rolin)
- Day: Sunday
- Start: 12:30
- End: 12:55
- Video only: h1308
- Chat: Join the conversation!
Over the last five years, we ran an HPC system for life sciences on top of OpenStack, with a deployment pipeline built from Ansible, manual steps (see FOSDEM 2020 talk). It worked—but it wasn’t something we could easily rebuild from scratch or apply consistently to other parts of our infrastructure.
As we designed our new HPC system (coming online in early 2026), we set ourselves a goal: treat the cluster as something we can declare and then recreate, not pet and nurture. The result is a “zero‑touch” style pipeline where a new node can go from “just racked” to “in SLURM and running jobs” with no manual intervention.
In this talk, we walk through the end‑to‑end workflow:
- NetBox as DCIM and source of truth: racking a server and adding it to NetBox is the trigger; MACs, serials and IPs are automatically imported from vendor tools and IPAM/DNS into our automation.
- Using Tofu/Terragrunt (instead of Openstack's Heat orchestration service) to provision OpenStack/Ironic, SLURM infrastructure and network fabric across three environments (dev plus two interchangeable prod clusters for blue/green rollouts).
- Image‑based deployment with Packer and Ansible: we split roles into “install” and “configure”. Packages and heavy setup are baked into images, while an ansible-init service runs locally on first boot to apply configuration and join the cluster.
- Making nodes self‑sufficient, including fetching the secrets they need via short‑lived credentials and a minimal external dependency chain.
- The pitfalls: cloud‑init bugs in non‑standard setups, weirdness with multiple datasources and host types, and how we worked around them.
Come and see how we built a reproducible HPC/Big-Data cluster on open‑source tooling, reusing as much of the stack as possible for the rest of our infrastructure.
About the speakers: Ümit Seren and Leon Schwarzäugl are HPC systems engineers at the Vienna BioCenter home to 3 life science institutes. Over the past years, they helped design, deploy and operate an OpenStack‑based HPC cluster and are now leading the automation and deployment architecture of the new HPC system coming online in 2026. Their interests include bare‑metal automation, reproducible infrastructure, high‑throughput computing and making complex systems easier to operate and debug.
Speakers
| Erich B | |
| Ümit Seren | |
| Leon Schwarzäugl |