FOSDEM 2019
/
Schedule
/
Events
/
Developer rooms
/
HPC, Big Data and Data Science
/
Feature store: A Data Management Layer for Machine Learning

Feature store: A Data Management Layer for Machine Learning

Data Management for ML

Track: HPC, Big Data and Data Science devroom
Room: UA2.118 (Henriot)
Day: Sunday
Start: 12:50
End: 13:00

Data may be the new oil, but refined data is the fuel for AI. Machine learning (ML) systems are only as good as the data they are trained on and getting the data in the right format at the right time is a challenge. ML systems are trained using sets of features, a feature can be as simple as the value of a column in a database entry, or it can be a complex value that is computed from diverse sources.

A feature store is a central vault for storing documented and curated features, ideally with support for access control. A feature store enables automatic feature analysis and monitoring, feature sharing across models and teams, feature discovery, feature backfilling, and feature versioning. The feature store is a data management layer that fills an important piece in the modern machine learning infrastructure, it empowers enterprises to scale their machine learning workflows and make full use of their investment in machine learning.

In this talk, we will present key points on how to take your machine learning workflow to the next level using a feature store, and demonstrate how the feature store fits into the larger machine learning pipeline. We will introduce HopsML, an open-source, end-to-end machine learning pipeline built on the world's most fastest and most scalable Hadoop distribution, Hops Hadoop. With HopsML you can build production-ready machine learning pipelines using open source software, where features are stored in a shared feature store that is automatically backfilled as new data arrive, where machine learning models can be trained on datasets in the order of billions examples using distributed deep learning, where data scientists can follow engineering principles by using versioned and reproducible experiments, and where models can be automatically deployed in an elastic manner using auto-scaling.