Data pipelines != workflows: orchestrating data with Dagster PyData Global 2022

Data pipelines != workflows: orchestrating data with Dagster
.ical

12-02, 19:00–19:30 (UTC), Talk Track II

Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.

Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit - they schedule tasks, but miss that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage.

Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.

Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.

Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit - they schedule tasks, but don’t understand that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage. They manage production workflows, but make it hard to work with pipelines in local development, unit tests, CI, code review, and debugging.

Dagster is an open-source framework and orchestrator built to help data practitioners develop, test, and run data pipelines. It takes a declarative approach to data orchestration that starts with defining data assets that are supposed to exist and the upstream data assets that they’re derived from. It lets the git repo become the source of truth on data, so pushing data changes feels as safe as pushing code changes. It supports an organization-wide data asset lineage graph, that can be subsetted for scheduled or ad-hoc execution. It’s built to facilitate data pipelines in local development, unit testing, CI, code review, staging environments, and debugging.

Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.

Prior Knowledge Expected –

No previous knowledge expected

Sandy Ryza

Sandy works at Elementl as the lead engineer for the Dagster project. Prior, he led machine learning and data science teams at KeepTruckin and Clover Health. He's a committer on Spark and Hadoop, and co-authored O'Reilly's Advanced Analytics with Spark.

Data pipelines != workflows: orchestrating data with Dagster .ical 12-02, 19:00–19:30 (UTC), Talk Track II

Data pipelines != workflows: orchestrating data with Dagster
.ical

12-02, 19:00–19:30 (UTC), Talk Track II