Not less, Not more. Exactly Once Large-Scale Stream Processing in Action.
Large-scale data stream processing has come a long way to where it is today. It combines all the essential requirements of modern data analytics: subsecond latency, high throughput and impressively, strong consistency. Apache Flink is a system that serves as a proof-of-concept of these characteristics and it is mainly well-known for its lightweight fault tolerance. Data engineers and analysts can now let the system handle Terabytes of computational state without worrying about failures that can potentially occur.
In this talk, I am going to explain all the fundamental challenges behind exactly-once processing guarantees in large-scale streaming in a simple and intuitive way. I will further guide you through the tricks and pitfalls that we faced in our effort to make state management easy to use, transparent and yet extraordinarily powerful. Finally, I will demonstrate how you can declare state and the in-flight protocol that is running underneath your processing pipeline to guarantee that your computation will always run consistently and uninterrupted, until infinity.
In more detail I am going to demonstrate the basic and extended versions of ABS (Asynchronous Barrier Snapshotting) , our state-of-the-art decentralized, in-flight snapshotting algorithm tailored to the needs of a dataflow graph. ABS needs no global synchronisation and works purely in a decentralised manner. It is considered today as one of the most lightweight mechanisms for checkpointing application state and has been developed even further together with the rest of the Flink community with additional optimisations. In fact, other open source systems such as Apache Storm have also recently incorporated the same mechanism due to its simplicity and effectiveness. One of such optimisations that is currently being merged in the system, is the adaptation of the algorithm in graphs that contain cycles which I am also going to cover in this talk in more detail. Furthermore, I will also briefly explain our current efforts to incrementalize state snapshots to further balance the load trade-off of state checkpointing and recovery time.