PyData Global 2022

Missing Data in the Age of Machine Learning
12-02, 19:00–20:30 (UTC), Workshop/Tutorial II

Machine learning algorithms, especially artificial neural networks, are not tolerant of missing data. Many practitioners simply remove records with missing fields without any consideration for the potential statistical bias that might be introduced. The field of imputation has become mature with imputations not only predicting missing values, but reflecting the uncertainty in the prediction. Traditional statistical estimators make use of the full benefits offered by advanced imputation techniques. This tutorial illustrates techniques and architectures that can incorporate advanced imputation techniques into machine learning pipelines including artificial neural networks.


  • The tutorial illustrates how artificial neural networks can be used as a missing data imputer. The description of these neural network imputers illustrates how susceptible bias in the treatment of missing values can affect a neural network model. The tutorial then illustrates how imputation techniques can be incorporated into a machine learning pipeline. Finally, several architectures and pipeline techniques are demonstrated showing how to incorporate the statistical uncertainty present in imputed data into the training and operation of a machine learning model from techniques such as multiple imputation.
  • This tutorial is intended for participants with an intermediate level understanding of Python and basic understanding of Pandas. Some knowledge of SciKit-Learn, Keras and Tensorflow is useful but not required. The majority of the tutorial uses Jupyter notebooks, so participants should be comfortable in that environment.

Curriculum outline is as follows:
* Neural Network Autoencoder Imputers
* Participants will apply denoising autoencoders to simple imputation tasks.
* Autoencoder techniques require an initial imputation, various initial imputations are applied allowing the participant to see biasing effects of imputation on a neural network model.
* Building Neural Network Pipelines with Imputers
* Participants will build machine learning pipelines that address missing data. Simple example pipelines are built using scikit-learn’s pipelines.
* These pipelines accommodate the three phases of machine learning usage, training, testing and production.
* Incorporation of multiple imputation
* Multiple imputations capture uncertainty in predicted missing values, by providing a number of possible imputed values for a given missing value. Several methods of incorporating multiple imputed values are demonstrated in the tutorial.
* Methods demonstrated include using data augmentation and ensembles


Prior Knowledge Expected

Previous knowledge expected

Dr. Haw-minn Lu is currently a Principal Data Scientist for Data Science/Machine Learning at West Health Institute in La Jolla, a nonprofit medical research organization. Dr. Lu earned his PhD in 1998 from the Electrical and Computer Engineering Department at the University of California, San Diego after receiving SM and SB degrees from the Massachusetts Institute of Technology.

Dr. Lu has been doing machine learning for over 25 years. His interests include machine learning, interactive visualization, data imputation/anonymization, and computing infrastructure.

Prior to joining West Health, Dr. Lu was involved in several startups using Python as the core infrastructure for applications such as ecommerce, network communications, and digital animation.

PhD Student @ UCSD