PyData Global 2022

Visually Inspecting Data Profiles for Data Distribution Shifts
12-03, 13:00–14:30 (UTC), Workshop/Tutorial I

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.

In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, such as visually inspecting histograms, applying statistical tests and ensuring quality with data validation checks.

Requirements: Access to Google Colab Environment

Additional Material: https://colab.research.google.com/drive/1xOcAq8NwPazmQFhXVEvzRxXw5LiFqvfj?usp=sharing


Tutorial Outline

Session 1 - Data Distribution Shift (25min + 5min Q&A)

In this session, we’ll introduce the concept of data distribution shifts, and exactly why this is a problem for ML practitioners. We will cover different types of distribution shifts and how to measure them with popular statistical packages.

This is a theoretical session with hands-on examples.

Session 2 - Facing the Real World (10min + 5min Q&A)

In the real world, we might not always have data readily available as we would like. In this session, we’ll cover several challenges presented by the real world, and how we can leverage data logging with the whylogs package to help us overcome these challenges.

This is a theoretical session with hands-on examples.

Session 3 Inspecting and Comparing Distributions with whylogs (15min + 10min Q&A)

In this session, we will explore the whylogs’ Visualizer Module and its capabilities, using the Wine Quality dataset as a use-case to demonstrate distribution shifts. We will first generate statistical summaries with whylogs and then visualize the profiles with the visualization module.

This is a Hands-on Notebook Session.

Session 4 - Data Validation (15min + 5min Q&A)

As discussed in previous sessions, data validation plays a critical role in detecting changes in your data. In this session, we will introduce the concept of constraints - ways to express your expectations from your data - and how to apply them to ensure the quality of your data.

This is a Hands-on notebook session.


Prior Knowledge Expected

No previous knowledge expected

Felipe is a Data Scientist in WhyLabs. He is a core contributor to whylogs, an open-source data logging library, and focuses on writing technical content and expanding the whylogs library in order to make AI more accessible, robust, and responsible. Previously, Felipe was an AI Researcher at WEG, where he researched and deployed Natural Language Processing approaches to extract knowledge from textual information about electric machinery. He is also a Master in Electronic Systems Engineering from UFSC (Universidade Federal de Santa Catarina), with research focused on developing and deploying fault detection strategies based on machine learning for unmanned underwater vehicles. Felipe has published a series of blog articles about MLOps, Monitoring, and Natural Language Processing in publications such as Towards Data Science, Analytics Vidhya, and Google Cloud Community.