Sandy Ryza
Sandy works at Elementl as the lead engineer for the Dagster project. Prior, he led machine learning and data science teams at KeepTruckin and Clover Health. He's a committer on Spark and Hadoop, and co-authored O'Reilly's Advanced Analytics with Spark.
Sessions
Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.
Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit - they schedule tasks, but miss that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage.
Dagster is an open-source framework and orchestrator built to help data practitioners develop, test, and run data pipelines. It takes a declarative approach to data orchestration that starts with defining data assets that are supposed to exist and the upstream data assets that they’re derived from.
Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.