PyData Global 2022

Why we do ML model retraining wrong, and how to do better
12-02, 10:00–10:30 (UTC), Talk Track I

Machine learning models degrade with time. You need to update and retrain them regularly. However, the decision on the maintenance approach is often arbitrary, and the models are simply retrained on a schedule or after every new batch. This can lead to suboptimal performance or wasted resources. In this talk, I will discuss how we can do better: from estimating the speed of the model decay in advance to constructing a proper evaluation set.


Once you create a machine learning model and put it into production, the work does not stop. The model quality might degrade in time, and you need to keep an eye on it and retrain or update the models accordingly.

However, data scientists often do not give much thought to this maintenance process. Some models are never updated, while others are updated on an arbitrary schedule or after every new batch of data arrives. Each approach has its pitfalls. You might keep an underperforming model in production without knowing it, leave potential for model improvement on the table, or waste the resources on an unnecessary update.

Based on the experience of deploying and maintaining ML models in production, I will discuss different approaches to model retraining and how we can improve:
* Schedule-based retraining. How to estimate the optimal retraining cadence in advance by performing experiments on the training data.
* Trigger-based retraining. What is wrong with blind model retraining once you detect drift, and how to access if your new data batch is good enough.
* Model retraining process. How to decide between model retraining and an update, whether you should drop the old data, and how to construct a proper evaluation set.


Prior Knowledge Expected

Previous knowledge expected

Emeli Dral is a Co-founder and CTO at Evidently AI, a startup developing open-source tools to evaluate, test, and monitor the performance of machine learning models.

Earlier, she co-founded an industrial AI startup and served as the Chief Data Scientist at Yandex Data Factory. She led over 50 applied ML projects for various industries - from banking to manufacturing. Emeli is a data science lecturer at GSOM SpBU and Harbour.Space University. She is a co-author of the Machine Learning and Data Analysis curriculum at Coursera with over 100,000 students.