12-03, 14:00–14:30 (UTC), Talk Track I
In this talk we present Hamilton, a novel open-source framework for developing and maintaining scalable feature engineering dataflows. Hamilton was initially built to solve the problem of managing a codebase of transforms on pandas dataframes, enabling a data science team to scale the capabilities they offer with the complexity of their business. Since then, it has grown into a general-purpose tool for writing and maintaining dataflows in python. We introduce the framework, discuss its motivations and initial successes at Stitch Fix, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.
At Stitch Fix, a data science team’s feature generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python framework that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler approach for data science & data engineering teams to create, maintain, execute, and scale both the human and computational sides of feature/data transforms.
At a high level, we will cover:
- What Hamilton is and why it was created
- How to use it for feature engineering
- The software engineering best practices Hamilton prescribes that make pipelines more sustainable
- How Hamilton enables out-of-the-box scaling with common distributed compute frameworks
At a low level, you will learn:
- How a data science team at Stitch Fix scaled their team and code base with Hamilton to enable documentation-friendly, unit-testable code
- What Hamilton is and how the declarative paradigm it prescribes offers advantages over more traditional approaches
- How you can easily add runtime data quality checks to ensure the robustness of your pipeline
How the Ray/Dask/Spark integrations with Hamilton works and how they can help you scale your data
No previous knowledge expected
Elijah has always enjoyed working at the intersection of math and engineering. More recently, he has focused his career on building tools to make data scientists more productive. At Two Sigma, he was building infrastructure to help quantitative researchers efficiently turn ideas into production trading models. At Stitch Fix he leads the Model Lifecycle team — a team that focuses on streamlining the experience for data scientists to create and ship machine learning models. In his spare time, he enjoys geeking out about fractals, poring over antique maps, and playing jazz piano.
A hands-on leader and Silicon Valley veteran, Stefan has spent the last 15 years working on data and machine learning systems at companies like Stitch Fix, Nextdoor and LinkedIn.
Most recently, Stefan led the Model Lifecycle team at Stitch Fix. Its mission was to streamline the model productionization process for over 100+ data scientists and machine learning engineers. The infrastructure they built created and tracked tens of thousands of models, and provided automated deployment that adheres to MLOps best practices.
A regular conference speaker, Stefan has guest lectured at Stanford’s Machine Learning Systems Design course and is an author of a popular open source framework called Hamilton.