PyData Global 2022

Interpretable and realistic generative models in data science? Likelihood-free Bayes’ says yes!
12-01, 10:00–10:30 (UTC), Talk Track II

Are you fascinated by the real-life images or text produced by deep generative models but cannot interpret their underlying data generation process or see how they can be applied to other problems? I will talk about generative simulations built using knowledge of the problem domain that can produce realistic data in a variety of scenarios. This talk will be a Bayesian thinking exercise cum data science case study of product star rating timeseries from an online marketplace (like Amazon.com) – I will show how we use recent advances in likelihood-free Bayesian inference together with a detailed simulation of an online marketplace to directly infer factors involved in how customers purchase and rate products.


In recent years, generative models have been getting a lot of attention in the search for ML solutions that are both interpretable/explainable and robust. This is because of the hope that on the one hand, a model that generates realistic-looking datasets can also give insights into a real-world data generating process while on the other hand, such a model can be robust to outliers as it aware of the kinds of data it can or cannot generate. While deep generative models of images and text have been incredibly successful in the data generation task itself, their underlying neural network structure makes it challenging to interpret the way in which they generate data or generalize by further using them in a discriminative ML system that is also robust.

In this talk, I will talk about how realistic and detailed simulations of the problem domain might actually be the generative models we are looking for in interpretable and robust ML systems. Simulation models are common in the sciences (eg. simulations of planetary systems in astrophysics, agent-based simulations in the social sciences, neural circuit simulations in neuroscience, etc) and in engineering (eg. aerodynamic simulations in mechanical engineering, etc) and their parameters correspond directly to important causal factors in the data generating process. However, the “inverse problem” of inferring model parameters from observed data typically follows a computationally intensive grid-search like process and it is unclear how they can be used in discriminative ML tasks that dominate the world of real-life data science problems

I will go through our work on Bayesian inference in agent-based simulations of customers purchasing products on an online marketplace like Amazon.com to:
1.) Motivate the need for generative models in modern ML systems and how generative simulations have interpretability built in (~10mins).
2.) Show how recent advances in likelihood-free Bayesian inference with deep learning can infer posterior distributions over interpretable parameters at scale across the entire catalog of an online marketplace (~10mins).
3.) Use the inferred posteriors for posterior predictive simulations in rating timeseries prediction and abnormal (or fake) rating classification tasks (~10mins).

This talk will appeal to both the beginning and seasoned data science practitioner. For beginners, the first half of the talk will highlight the importance of generative models for the data science problems, show how to build such a model for an online marketplace and explain fundamental Bayesian concepts like posterior inference and posterior predictive sampling. For more experienced data scientists, the second half will go into details of likelihood-free Bayesian inference for posterior estimation and how deep probabilistic neural networks can be used for this task. Throughout the talk, I will show code snippets from our (in development) Python package to simulate and infer parameters for online marketplaces that we hope can be used by attendees directly in their own daily data science work.


Prior Knowledge Expected

No previous knowledge expected

I am a Research Scientist at Philips Research in the Netherlands and have wide interests in Bayesian statistics, probabilistic modelling and open source development (and how bicycles are central to sustainable cities!). Please visit my website to learn more about me or to get in touch: https://narendramukherjee.github.io/