Putting Your Jobs Under the Microscope using OGRT
With the advent of modern package managers for scientific applications (EasyBuild, Spack, etc.) automated building of large amounts of software is becoming easier, quickly giving rise to issues related to life cycle management of applications. This makes tracking the applications and libraries that actually get used considerably more important. Existing solutions (module load hooks, launch wrappers) do not account for user-built software, are hard to deploy or produce inconclusive results.
OGRT enables the tracking of jobs on a cluster with process-level granularity and without discernible performance penalty. It tracks used shared libraries, environment variables and loaded modules at the moment of process execution. It also supports watermarking executables and shared objects and reading those watermarks out of memory at runtime. Gathered information is collected and shipped to various backends.
OGRT aims to be a versatile tool, which can be used to:
- provide a census of used software (including user-built)
- troubleshoot problems with programs picking up unexpected shared libraries
- retroactively inform users about buggy libraries
- overlay process-level data onto existing job monitoring tools
- contribute to reproducibility of application runs
This presentation will give an overview of the design and implementation of OGRT, as well as demoing some of its capabilities when plugged into an Elasticsearch backend.