PyData Global 2022

Building Data Products in a Lakehouse using Trino, dbt, and Dagster
12-01, 11:30–13:00 (UTC), Workshop/Tutorial II

Build data pipelines using Trino and dbt, combining heterogeneous data sources without having to copy everything into a single system. Manage access to your data products using modern and flexible security principles from authentication methods to fine-grained access control. Run and monitor your data pipelines using Dagster.


Data engineers struggle with enormous amounts of data, lack of management, governance and transparency. The variety of data sources and formats can be overwhelming and require working with many different tools. This tutorial shows how to use open source technologies to resolve common data problems in organizations. Using Trino/Starburst as a Data Mesh platform could help businesses to standardize processes and have a single platform providing a holistic view over different data sources, like NoSQL, SQL and data lakes.
Data pipelines over different data sources without having to move the data into one single system using standard SQL.

Breakdown:
- Intro to Trino
- Architecture
- Data federation
- Demonstrate using Docker
- Business case
- ecommerce
- Building a data pipeline spanning multiple data sources using dbt and Trino
- Build a simple data product using dbt
- Incremental loads (snapshot and merge)
- Make your data pipeline observable using Dagster
- Model your pipeline in Dagster
- Run your pipeline
- Monitor your pipeline


Prior Knowledge Expected

No previous knowledge expected

Software Engineer from Starburst. Passionate about Data Engineering, Big Data and all things data related.

Michiel is a long-time Software & Data Engineer specialized in implementing Data Platforms.

At Starburst Data Michiel is hyper-focussed on improving the Starburst Python and Lakehouse ecosystem.

In his spare time Michiel is also the maintainer of the popular vscode dbt extension, dbt Power User.