PyData Global 2022

Knowing what you don’t know matters: Uncertainty-aware model rating
12-01, 11:30–12:00 (UTC), Talk Track I

Meaningful probabilistic models do not only produce a “best guess” for the target, but also convey their uncertainty, i.e., a belief in how the target is distributed around the predicted estimate. Business evaluation metrics such as mean absolute error, a priori, neglect that unavoidable uncertainty. This talk discusses why and how to account for uncertainty when evaluating models using traditional business metrics, using python standard tooling. The resulting uncertainty-aware model rating satisfies the requirements of statisticians because it accounts for the probabilistic process that generates the target. It should please practitioners because it is based on established business metrics. It appeases executives because it allows concrete quantitative goals and non-defensive judgements.


This talk will equip you - a Data Scientist or a person working with Data Scientists - with the background and tooling necessary to rate models in an uncertainty-aware fashion. You’ll learn to establish “best-case” and “worst-case”-benchmarks and judge your models against these. This will help you answering the question “how good is the model, and how good could it become?” in a non-defensive way, beyond merely computing standard evaluation metrics.

Why is uncertainty-awareness important? Well, would you, as a Data Scientist, accept to be evaluated on reducing mean absolute error of some model by 50%? Probably not - but would you feel comfortable explaining why? This talk will establish why such generic and ad hoc goal setting is not meaningful, why model judgement is harder than expected, and how it can still be done reliably and without too many technicalities. We will exemplify uncertainty-aware model rating using the M5 competition data (Walmart sales numbers). Some immediate interpretations (“model A is clearly better than model B”) will turn out to be flawed upon closer inspection, and we’ll see how to correct them, using standard python libraries (numpy, scipy, pandas).

Takeaways:
- We are typically too self-confident in our skills when it comes to judging models. Instead of jumping to immediate conclusions (“that 80%-error model is bad!”), we should take a step back, build a “reasonable best case”, and benchmark the candidate against that.
- Accepting and dealing with uncertainty is strength, neglecting it is madness. Uncertainty-aware model rating allows us to make reliable statements about “how good the model really is”, without “it depends” and “buts”.
- Requirements from statisticians and business stakeholders can be reconciled by taking both of them seriously. Standard python tooling suffices to improve our modelling of what we know that we don't know.


Prior Knowledge Expected

Previous knowledge expected

After pursuing his PhD and postdoc research in theoretical quantum physics, Malte joined Blue Yonder as a Data Scientist in 2015. Since then, he has led numerous external and internal projects, which all involved programming python, creating, working with and evaluating probabilistic predictions, and communicating the achieved results.