Brussels / 2 & 3 February 2019

schedule

Validating Big Data Jobs

An exploration with Spark & Airflow (+ friends)


If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time.

If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time.

Validating Big Data Jobs sounds expensive and hard, but with a variety of techniques can be done relatively easily with only minimal additional instrumentation overhead. We’ll explore the kinds of instrumentation to add to your pipeline to make it easier to validate. For jobs with hard to meet SLAs we’ll also explore what can be done with existing metrics and parallel data validation jobs.

After exploring common industry practices for Data Validation we’ll explore how to integrate these into an Airflow pipeline while making it recoverable if manual validation over-rules the automatic safeguards.

Speakers

Photo of Holden Karau Holden Karau

Links