Brussels / 1 & 2 February 2025

schedule

Making Data Fun Again: Extending EESSI to improve Research Data Management


While digitalisation leads to ever growing data collections, exploiting the full potential of these data collections remains a challenge. Relevant data is difficult to find and it's not easy to use either. At the same time making data available such that others can find and (re)use them is cumbersome. For more than a decade various approaches - technical, organisational, cultural - are followed to help improve the situation, yet at best one witnesses isolated solutions and for many research data management remains a hassle.

Solving challenges of Research Data Management (RDM) requires coordinated efforts of many parties involved. What happens often is that research projects start by writing down a Data Management Plan (DMP), then do their research and at the end struggle making their data publicly available as required by publishers and research funders.

This talk proposes extensions to EESSI - the European Environment for Scientific Software Installations (pronounced as "easy", https://eessi.io) - to help improve research data management. EESSI provides a large software stack of hundreds of software installations that are streamed on-demand to client systems. It combines several FOSS tools including CernVM-FS (https://cernvm.cern.ch/fs/), Gentoo Prefix (https://wiki.gentoo.org/wiki/Project:Prefix), EasyBuild (https://easybuild.io) and Lmod (https://lmod.readthedocs.io/) to build a service that can be used by anyone on any Linux machine anywhere in the world. A user just needs to install and configure a small client on his/her system and is ready to do science - analysing or generating data - within a few minutes.

A key building block of EESSI is its compatibility layer which provides system-level tools and libraries for basic functions such as file management, process creation, and so on. The compatibility layer uses Gentoo that is installed under a prefix. For each of the CPU families supported by EESSI - x86_64 and aarch64 (and soon riscv64) - EESSI just needs a single common installation of Gentoo. Because Gentoo is built from sources, extending a few functions in that layer is particularly easy and everyone who uses EESSI would be enabled to use these extensions.

Extending core functions of EESSI's compatibility layer facilitates the logging of data accesses (which process accesses what data and when) and simplifies access to remote data (data that is not already present where the data processing should happen).

Logging information about data accesses facilitates the creation of data flow graphs which can be used for optimisation, reproducibility, and in particular to automate the description of how results such as diagrams, tables, other data products were obtained. Detailed descriptions may help others to understand how the data was processed, to apply the processing to new data or to modify the processing to their needs.

Today, more and more data sets are published and associated with a persistent identifier. Instead of downloading data sets in a separate step and then processing them, the EESSI compatibility layer can be extended such that data sets can be accessed directly via persistent identifiers. While it may seem like a small improvement, this would enrich the logging information. Without any further steps by a researcher the used data sources would be identifiable and they could be cited automatically. Reproducing results would be improved and last but not least those who published the data would be acknowledged more easily.

Speakers

Thomas Röblitz