PyData Global 2022

Felipe de Pontes Adachi

Felipe is a Data Scientist in WhyLabs. He is a core contributor to whylogs, an open-source data logging library, and focuses on writing technical content and expanding the whylogs library in order to make AI more accessible, robust, and responsible. Previously, Felipe was an AI Researcher at WEG, where he researched and deployed Natural Language Processing approaches to extract knowledge from textual information about electric machinery. He is also a Master in Electronic Systems Engineering from UFSC (Universidade Federal de Santa Catarina), with research focused on developing and deploying fault detection strategies based on machine learning for unmanned underwater vehicles. Felipe has published a series of blog articles about MLOps, Monitoring, and Natural Language Processing in publications such as Towards Data Science, Analytics Vidhya, and Google Cloud Community.

The speaker's profile picture

Sessions

12-03
13:00
90min
Visually Inspecting Data Profiles for Data Distribution Shifts
Felipe de Pontes Adachi

The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.

In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, such as visually inspecting histograms, applying statistical tests and ensuring quality with data validation checks.

Requirements: Access to Google Colab Environment

Additional Material: https://colab.research.google.com/drive/1xOcAq8NwPazmQFhXVEvzRxXw5LiFqvfj?usp=sharing

Workshop/Tutorial I