Brussels / 1 & 2 February 2025

schedule

ZML: A High-Performance AI Inference Stack Built for Production and Multi-Accelerator Deployment


In this talk, we’re introducing ZML, our open-source high-performance AI inference stack built for production on the foundations of the Zig programming language, OpenXLA, MLIR, and Bazel.

We’ll discuss the framework, how to get started, and how we use it to build high-performance inference solutions, with a strong focus on production readiness.

We’ll also highlight some of the “plumbing tools” built into ZML that make production-ready, cross-platform, multi-accelerator deployments seamless.

For example, we’ll demonstrate how providing just a few simple, easy-to-remember command-line options to the build command can produce a Docker image with a Linux binary capable of running on a wide range of runtimes. This includes NVIDIA CUDA, AMD ROCm, Google TPU, AWS Neuron accelerator runtimes, or even a CPU. The process also includes pushing the image to your container registry—all in one command. Identical executable, single image.

We’ll also go into the lengths ZML goes to in downloading and auto-packaging only the absolute essentials of the chosen runtimes at build-time, such as CUDA, ROCm, etc, significantly reducing the size of the built artifact, whether it’s an OCI image or a .tar file.

ZML is also developer-friendly, automatically providing the Zig compiler, language server, and pre-configured setups for VS Code and NeoVim. Its use of Zig, a systems programming language with low-level control and robust abstractions, supports efficient memory management, error handling, and software correctness—key factors for production-grade solutions.

With this talk, we’d like to share our excitement about ZML, its unique approach to simplifying the complexities of delivering production-ready AI systems, and how it enables developers to efficiently transition from development to deployment.

Speakers

Photo of Rene Schallner Rene Schallner
Guillaume Wenzek