PyData Global 2022

Real-world Perspectives to Avoid the Worst Mistakes using Machine Learning in Science
12-02, 13:00–15:00 (UTC), Workshop/Tutorial I

Numerous scientific disciplines have noticed a reproducibility crisis of published results. While this important topic was being addressed, the danger of non-reproducible and unsustainable research artefacts using machine learning in science arose. The brunt of this has been avoided by better education of reviewers who nowadays have the skills to spot insufficient validation practices. However, there is more potential to further ease the review process, improve collaboration and make results and models available to fellow scientists. This workshop will teach practical lessons that can be directly applied to elevate the quality of ML applications in science by scientists.

It seems like we avoided the worst signs of the reproducibility crisis in science when applying machine learning in science. Thanks to better education for reviewers, easier access to tools, and a better understanding of zero-knowledge models.

However, there is much more potential for ML in science. The real world comes with many pitfalls that make the application of machine learning very promising, but the verification of scientific results is complex. Nevertheless, many open-source contributors in the field have worked hard to develop practices and resources to ease this process.

We discuss pitfalls and solutions in model evaluation, where the choice of appropriate metrics and adequate splits of the data is important. We discuss benchmarks, testing, and machine learning reproducibility, where we go into detail on pipelines. Pipelines are a great showcase to avoid the main reproducibility pitfalls, as well as, a tool to bridge the gap between ML experts and domain scientists. Interaction with domain scientists, involving existing knowledge, and communication are a constant undercurrent in producing trustworthy, validated, and reliable machine learning solutions.

Overall, this workshop relies on existing high-quality resources like the Turing Way, more applied tutorials like Jesper Dramsch’s Euroscipy tutorial on ML reproducibility, and professional tools like the Ersilia Hub. Where we utilize real-world examples from different scientific disciplines, e.g. weather and biomedicine.

In this workshop, we present a series of talks from invited speakers that are experts in the application of data science and machine learning to real-world applications. Each talk will be followed by an interactive session to take the theory into practical examples the participants can directly implement to improve their own research. Finally, we close on a discussion that invited active participation and engagement with the speakers as a group.


Time Topic Speaker
5 min Opening of workshop Jesper Dramsch
20 min Why and how make ML reproducible? Jesper Dramsch
25 min Evaluating Machine Learning Models Valerio Maggio
10 min ML for scientific insight Mike Walmsley
10 min Break & Chat
10 min Testing in Machine Learning Goku Mohandas
25 min Integrating ML in experimental pipelines Gemma Turon
10 min Discussion & Audience Questions All Speakers
5 min Closing Jesper Dramsch



Overview Talk: Why and how make ML reproducible? (Jesper Dramsch)

The overview talk serves to set the scene and present different areas where researchers can increase the quality of their research artefacts that use ML. These increases in quality are achieved by using existing solutions to minimize the impact these methods take on researcher productivity.

This talk loosely covers the topics Jesper discussed in their Euroscipy tutorial which will be used for the interactive session here:

Topics covered:

  1. Why make it reproducible?
  2. Model Evaluation
  3. Benchmarking
  4. Model Sharing
  5. Testing ML Code
  6. Interpretability
  7. Ablation Studies

These topics are used as examples of β€œeasy wins” researchers can implement to disproportionately improve the quality of their research output with minimal additional work using existing libraries and reusable code snippets.

Evaluating Machine Learning Models (Valerio Maggio)

In this talk, we will introduce the main features of a Machine Learning (ML) Experiment. In the first part, we will first dive into understanding the benefits and pitfalls of common evaluation metrics (e.g. accuracy VS F1 score), whilst the second part will be mainly focused on designing reproducible and (statistically) robust evaluation pipelines.

The main lessons learnt and takeaway messages from the talk will be showcased in an interactive tutorial.

ML for scientific insight (Mike Walmsley)

Building ML models is easy; answering science questions with them is hard. This short talk will introduce common issues in applying ML, illustrated with real failures from astronomy and healthcare - including some by the speaker. We hope sharing the lessons learned from these failures will help participants build useful models in their own field.

Testing in Machine Learning (Goku Mohandas)

What is testing ML and how it's different from testing deterministic code
Why it's important to test ML artifacts (data + models)
What testing data and testing models looks like (and I'll provide quick code snippets so people can see what it looks like)
Concluding thoughts on how testing relates to monitoring and continual learning.

Integrating ML in experimental pipelines (Gemma Turon)

This talk will focus on the implementation of ML models to actual experimental pipelines. We will review strategies for sharing pre-trained models that can be readily adopted by non-expert users, and thow to bridge the gap between dry-lab and wet-lab researchers, with case studies in the field of biomedicine. The interactive tutorial will exploit one of such pretrained open source model hub repositories, the Ersilia Model Hub.

Additionally, here are some papers for further reading for the interested:


Prior Knowledge Expected –

Previous knowledge expected

Jesper Dramsch works at the intersection of machine learning and physical, real-world data. Currently, they're working as a scientist for machine learning in numerical weather prediction at the coordinated organisation ECMWF.

Before, Jesper has worked on applied exploratory machine learning problems, e.g. satellites and Lidar imaging on trains, and defended a PhD in machine learning for geoscience. During the PhD, Jesper wrote multiple publications and often presented at workshops and conferences, eventually holding keynote presentations on the future of machine learning.

Moreover, they worked as consultant machine learning and Python educator in international companies and the UK government. Their courses on Skillshare have been watched over 30 days by over 5000 students. Additionally, they create educational notebooks on Kaggle, reaching rank 81 worldwide. Recently, Jesper was invited into the Youtube Partner programme.

This speaker also appears in:

Valerio Maggio is a Researcher and Data scientist, currently working in Anaconda, inc. as Senior Developer Advocate. Valerio is also member of the Software Sustainability Institute, with a fellowship focused on Privacy Preserving methods for Data Science and Machine Learning. Valerio is an active member of the Python community: over the years he has led the organisation of many international conferences like PyCon/PyData Italy/EuroPython, and EuroSciPy. In his free time, Valerio is a casual "Magic: The Gathering" player of the Premodern format, enjoying playing with friends all over the world, and contributing to the community.

Trained as a molecular biologist, Gemma completed a PhD in colorectal cancer and stem cells at IRB Barcelona in 2019, before taking a one-year break to focus on working and volunteering in the third sector. This shifted her scientific interest to global health and neglected diseases, and the existing barriers to tackle some of the most urgent health issues in developing countries. With Ersilia, she aims to explore new ways of community building and engagement in the scientific arena, at the intersection between academia, biotech start-ups and NPOs.

Postdoc using ML and citizen science to answer astrophysics questions. Lead data scientist for Galaxy Zoo.

🌏 Founder @MadeWithML
🍎 AI Research @Apple
βš•οΈ ML Lead @Ciitizen (acq.)
πŸŽ“ CS/ML @GeorgiaTech
πŸŽ“ Chem/Bio @JohnsHopkins