PyData Global 2022

Allan Campopiano

As a data scientist at Deepnote, I have the privilege of partnering with developers all over the world in order to help them promote their tools to the broader scientific community. By demonstrating the leading data science tools in Deepnote, scientists and developers can easily onboard to new concepts and techniques.

My degree in cognitive and behavioural neuroscience helped me realize my dual passion for (1) developing scientific software and (2) communicating technical concepts in a straightforward manner. My main goal is to find creative ways to lower the barrier-to-entry for scientists who are learning new tools.

To this end, I've published two peer-reviewed statistical software libraries. The most notable is Hypothesize—a Python library for robust statistics based on Rand Wilcox's R package. I continue to deliver workshops on robust statistics, data visualization, and data science tooling in general.

The speaker's profile picture

Sessions

12-01
20:30
90min
Lightning Talks
Brian Skinn, Kacper Łukawski, Kurt Schelfthout, Richard Lee, Allan Campopiano, Eyal Kazin, Ziheng Wang, Caroline Arnold

Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.

Lightning Talks
Talk Track II
12-02
17:00
30min
Machine Learning in the Warehouse with Python
Allan Campopiano

Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.

Talk Track II
10min
It might look normal but this distribution will ruin your stats
Allan Campopiano

Some refer to the normal distribution as "God's curve" because of its supposed presence in nature when enough observations are collected. But what if I told you that there is a non-normal distribution that looks so normal that even experts can't see the difference? And beyond looks, it's a curve that is both prevalent in nature and likely to cause false negatives when testing hypotheses.

May I Introduce you to the "contaminated normal distribution". It's bell shaped. It's symmetrical. And other than slightly heavier tails, it is virtually indistinguishable from a normal distribution with the naked eye. (I'll be testing the audience on this!)

As you may remember, many inferential statistical tests and ML algorithms assume that distributions are normal. Violating that assumption can lead to uninterpretable results. Over the past half century, many important discoveries have been made that call into question the usefulness of models based on the normal curve (see Campopiano & Wilcox, 2020; Wilcox, 2013). In fact, distributions that produce outliers, such as the contaminated normal, are likely to be encountered in practice.

In this beginner-friendly stats talk, I will show you a normal curve and a contaminated normal curve to see if you can tell which is which! More importantly, you will learn (1) how contamination affects the probability of finding an effect in our experiments (statistical power) and (2) which Python tools you can use to protect against contamination so that your results are squeaky clean.