Accelerating complex Bioinformatics AI pipelines with Kubernetes
- Track: HPC, Big Data & Data Science
- Room: H.1308 (Rolin)
- Day: Sunday
- Start: 13:00
- End: 13:10
- Video only: h1308
- Chat: Join the conversation!
Bioinformatics is an interdisciplinary scientific field that deals with large amounts of biological data. The advent of transformer models applied to this field brought very interesting scientific innovations, including the introduction of Protein Language Models (PLMs) and Antibody Language Models (AbLMs). The complexity of training or fine tuning PLMs/AbLMs along with inference tasks requires a non-trivial amount of GPU resources and a disciplined approach, where DevOps and MLOps methodologies fit very well. In this session we will present a series of tasks related to fine tuning PLMs/AbLMs for classification of SARS-CoV-2's spike proteins. We will highlight how Kubernetes can be used to execute large numbers of computationally intensive tasks on GPU hosts, including best practices for sharing Nvidia GPUs (MIG, Time Slicing, MPS) as part of an open source stack orchestrated with Apache Airflow. While these methodologies can be applied to any Kubernetes cluster, including on hyperscalers, this talk is meant to facilitate the (re)use of on-prem hardware infrastructure, presenting a fully open-source stack that can be easily deployed and maintained on bare metal.
Source code: https://github.com/alexpilotti/bbk-mres https://github.com/alexpilotti/bbk-mres-airflow
Overview of the scientific research made possible by this pipeline: https://cloudba.se/NeBzX
Speakers
| Alessandro Pilotti |