Brussels / 3 & 4 February 2024


Automating Spark (and Pipeline) Upgrades While "Testing" in Production

With Spark 4 in the pipeline for this year, many of us are looking at what will be involved in upgrading to the latest and greatest Spark. This talk will look at the open-source tooling we use to automate upgrading thousands of our Spark pipelines (from Spark 2.X -> 3.4) and how to we used a variation of the write-audit-publish technique to validate the new pipelines in production for pipelines that might have less testing than ideal.

Seeing is a pre-requisite to believing, so the talk will include a short demo showing how the spark-upgrade tool works on a demo pipeline complete with "live" validation.

In this talk, you will learn how to: Upgrade your Spark pipelines without crying* Validating Spark (and other similar) pipelines even when you don't trust the tests (by extending the write-audit-publish pattern) on top of Iceberg, Hudi, or Delta Lake.

Time permitting, I will end with some exciting new (but non-backward compatible) changes coming in Spark 4 (tentatively scheduled for June, but it's software).

*Not a guarantee, some upgrades may still cause tears

Related links: - the upgrade tool being discussed - the code mod tool we used for scala - the python code refactoring tool we extended - the sql linting / cleanup tool we extended


Holden Karau