FOSDEM 2019
/
Schedule
/
Events
/
Developer rooms
/
HPC, Big Data and Data Science
/
Validating Big Data Jobs

Validating Big Data Jobs

An exploration with Spark & Airflow (+ friends)

Track: HPC, Big Data and Data Science devroom
Room: UA2.118 (Henriot)
Day: Sunday
Start: 14:05
End: 14:30

If you, like close to half the industry, are deploying the results of you big data jobs into production automatically then existing unit and integration tests may not be enough to present serious failures. Even if you aren’t automatically deploying the results to production, having a more reliable deploy to production pipeline with automatic validation is well worth the time.

Validating Big Data Jobs sounds expensive and hard, but with a variety of techniques can be done relatively easily with only minimal additional instrumentation overhead. We’ll explore the kinds of instrumentation to add to your pipeline to make it easier to validate. For jobs with hard to meet SLAs we’ll also explore what can be done with existing metrics and parallel data validation jobs.

After exploring common industry practices for Data Validation we’ll explore how to integrate these into an Airflow pipeline while making it recoverable if manual validation over-rules the automatic safeguards.

Speakers

Holden Karau

FOSDEM19

Brussels / 2 & 3 February 2019

Validating Big Data Jobs

An exploration with Spark & Airflow (+ friends)

Speakers

Links

FOSDEM

This year

Practical information

Media and press