Graph Analytics on Massively Parallel Processing Databases
As graph processing moves to the mainstream, a large number of specialized graph engines have emerged. However, for many enterprises, much of their important data resides in relational databases and SQL is the most common workload. So is it reasonable to suggest that relational data processing engines can be used to solve graph problems in a productive and performant manner?
The answer to this question is: “Yes!”
In this talk, we will address the use of massively parallel processing (MPP) databases for graph analytics workloads. We will share some recent findings from the Apache MADlib (incubating) project, including design of graph data structures, implementation of common graph algorithms, and performance results.
Graph analytics is becoming an important part of enterprise computing. With roots in academia going back many decades, the last 10-15 years have seen a huge surge of interest in this topic to address a wide range of modern use cases, from cybersecurity to social networks to supply distribution chains.
Enterprises have made significant investments in infrastructure, software, and training of their employees, all centered around SQL. So how can an enterprise add graph analytics to their business without the cost and complexity of moving to specialized graph processing engines? And, what are the tradeoffs?
Graph analytics is a new area of innovation in Apache MADlib, which is a SQL-based open source library for scalable in-database analytics. It provides parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
Many existing analytics products do not scale in a way that makes it convenient and economical to operate on large data sets. The graph methods in Apache MADlib have been designed to take advantage of the shared-nothing, scale-out parallelism offered by modern parallel database engines.
I look forward to presenting this topic at FOSDEM 17!