FOSDEM 2019
/
Schedule
/
Events
/
Developer rooms
/
HPC, Big Data and Data Science
/
From Zero to Portability

From Zero to Portability

Apache Beam's Journey to Cross-Language Data Processing

Track: HPC, Big Data and Data Science devroom
Room: UA2.118 (Henriot)
Day: Sunday
Start: 14:30
End: 14:55

Apache Beam is a programming model for composing parallel and distributed data processing jobs.

Ultimately, these languages won't just coexist in Apache Beam, but they will complement each other in cross-language data processing jobs.

In this talk we will learn how it is possible to support multiple languages and why it might be a good idea to combine these languages in data processing jobs.

Apache Beam is a programming model for composing parallel and distributed data processing jobs. Once composed, these jobs run on various execution engines like Apache Flink, Apache Spark, or Google Cloud Dataflow. But Apache Beam's vision goes beyond just running on multiple execution engines.

As many other Apache projects, Beam first used Java as its API language. Unsatisfied with the status quo, Beam developers launched the portability project to enable other languages to run with Beam. Currently, Beam has a Java, Python, and a Go API. That means users are not restricted to the Java ecosystem but can use their favorite Python libraries like Numpy or Tensorflow with Apache Beam.

Ultimately, these languages won't just coexist in Apache Beam, but they will complement each other in cross-language data processing jobs. For example, reading from Kafka can be done with the Java connector but the data can afterwards be processed in Python.

In this talk we will learn how it is possible to support multiple languages and why it might be a good idea to combine these languages in data processing jobs.