PyData Global 2022

Data Validation for Feature pipelines: Using Great Expectations and Hopsworks
12-01, 09:00–09:30 (UTC), Talk Track II

Have you ever trained an awesome model just to have it break in production because of a null value? At its core a feature store needs to provide reliable features to data scientists to build and productionize models. So how can we avoid garbage in, garbage out situations? Great expectations is the most popular library for data validation, and so the two are a natural fit. In this talk we will touch briefly upon different Python data validation libraries such as Pydantic, Pandera but then dive deeper into Great Expectations’ concepts and how you can leverage them in feature pipelines powering a feature store.


Have you ever trained an awesome model just to have it break in production because of a null value? At its core a feature store needs to provide reliable features to data scientists to build and productionize models. So how can we avoid garbage in, garbage out situations? Great expectations is the most popular library for data validation, and so the two are a natural fit. In this talk we will touch briefly upon different Python data validation libraries such as Pydantic, Pandera but then dive deeper into Great Expectations’ concepts and how you can leverage them in feature pipelines powering a feature store.

After this talk you will …
1. Understand the tradeoffs and different uses of the three data validation libraries.
2. Understand the core concepts of great expectations and what they are for.
3. Understand the core principle of feature stores.
4. Understand how and why data validation fits into the workflow with a feature store.
5. Learn how we leverage Great Expectations in Hopsworks Feature Store to enhance data quality.


Prior Knowledge Expected

Previous knowledge expected

Moritz Meister is a Software Engineer at Hopsworks, leading the development of the Hopsworks Feature Store. He has a background in Econometrics and holds MSc degrees in Computer Science from Politecnico di Milano and Universidad Politecnica de Madrid. He has previously worked as a Data Scientist on projects for Deutsche Telekom and Deutsche Lufthansa in Germany, helping them to productionize machine learning models to improve customer relationship management.