Network Traffic Analysis of Hadoop Clusters
Understand the common usage patterns and identify typical / atypical workloads.
Cybersecurity is a broad topic and many commercial products are related to it. We demonstrate a fundamental concept in network analysis: re-construction and visualization of temporal networks. Furthermore, we apply the method to describe operational conditions of a Hadoop cluster. Our experiments provide first results and allow a classification of the cluster state related to current workloads. The temporal networks show significant differences for different operation modes. In reallity we would expect mixed workloads. If such workload parameters are known, we are able to handle a-typical events accordingly - which means, we are able to create alerts based on context information, rather than only the package content. We show an end-to-end example: (1) Data collection is done via python, using the sniffer script; (2) using Apache Hive and Apache Spark we analyze the network traffic data and create the temporary network. Finally, we are able to visualize the results using Gephi in step (3). In a next step, we plan to contribute to the Apache Spot project.
Expected prior knowledge / intended audience:
No special skills required, but minimal exposure to the Hadoop ecosystem is helpful.
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of open source Big Data related conferences including Hadoop Summit and Apache Big Data and meetups recently.
Mirko Kämpf is a Solution Architect at Cloudera and the initiator of the Etosha project. He holds a Diploma in Physics and worked on several projects related to complex systems analysis. His focus is on time dependent network analysis and time series analysis, using tools from the Hadoop ecosystem, and especially on the related metadata management. Mirko is actively using open source tools, author of several blog articles in the Cloudera engineering blog, and a speaker in Big Data related conferences and meetups.
Links to previous talks by the speaker
Hadoop Summit, Dublin, 2016 https://www.youtube.com/watch?v=mRhCpp-p11E
Flink Meetup, Berlin, 2016 https://www.youtube.com/watch?v=Rk8mVtGumPc&t=462s
Flink Forward, Berlin, 2016 https://www.youtube.com/watch?v=FtzXOLhZ-2c
Cloudera Technical Summit, Las Vegas, 2016 http://www.slideshare.net/mirkokaempf/from-events-to-networks-time-series-analysis-on-scale?qid=a3a3f939-19e4-4127-81a7-e963114d4110&v=&b=&from_search=1
GridKA, Karlsruhe, 2015 http://www.slideshare.net/mirkokaempf/apache-spark-in-scientific-applications?qid=b82c1d59-2098-409c-8b84-5570504c5546&v=&b=&from_search=4
GridKA, Karlsruhe, 2014 http://www.slideshare.net/mirkokaempf/hadoop-complex-systems-research?qid=a0eebdd3-b042-453d-9b65-a2e2301d09f8&v=&b=&from_search=6
Hadoop meetup, Munich, 2013 http://www.slideshare.net/mirkokaempf/munich-hug-20130522v2?qid=72841b48-efbf-442a-8b7f-0ea0db3b3ad4&v=&b=&from_search=5