PyData Global 2022

Machine Learning in the Warehouse with Python
12-02, 17:00–17:30 (UTC), Talk Track II

Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.


Objective: If you are a data scientist that already stores your data in a warehouse, this talk will teach and demonstrate how to run ML models with the new Snowpark Python library. If you are new to warehouse data storage, the demonstration walks through integrating a Snowflake database with a python notebook.

(10-15 mins) Snowpark Overview: We will run through the process of transforming data, training a model, and running the model while keeping all the data in one place. The Snowpark library provides an intuitive API for querying and processing data in a data pipeline.

(15 mins) ML Model Demonstration: The audience will be able to open the notebook and run the code themselves and leave with a more seamless ML workflow utilizing a pipeline in Python.

Thesis: Snowpark speeds up Python-based workflows with seamless access to open source packages and package manager via Anaconda Integration without having to move data.

This talk is for data scientists who have familiarity with data warehouses. A background in writing ML models in Python is recommended, but not necessary, as we will be going over the process from start to finish and providing all the code.


Prior Knowledge Expected

No previous knowledge expected

As a data scientist at Deepnote, I have the privilege of partnering with developers all over the world in order to help them promote their tools to the broader scientific community. By demonstrating the leading data science tools in Deepnote, scientists and developers can easily onboard to new concepts and techniques.

My degree in cognitive and behavioural neuroscience helped me realize my dual passion for (1) developing scientific software and (2) communicating technical concepts in a straightforward manner. My main goal is to find creative ways to lower the barrier-to-entry for scientists who are learning new tools.

To this end, I've published two peer-reviewed statistical software libraries. The most notable is Hypothesize—a Python library for robust statistics based on Rand Wilcox's R package. I continue to deliver workshops on robust statistics, data visualization, and data science tooling in general.

This speaker also appears in: