PyData Global 2022

Reactive data processing in Python
12-01, 20:00–20:30 (UTC), Talk Track I

Machine Learning models designed to work with streaming systems make decisions on new data points as they arrive. But there is a downside: model decisions can't be easily changed later when the model is updated with fresher data, user feedback, or freshly tuned hyperparameters. This is often a blocker for anomaly detection, recommender systems, process mining, and human-in-the-loop planning.

To deal with this, we'll demonstrate design patterns to easily express reactive data processing logic. We will use Pathway, a scalable data processing framework built around a Python programming interface. Pathway is battle-tested with operational data in enterprise, including graphs and event streams in real-world supply chains, and is now launching as open-core.

You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle to it, and the steps needed to overcome these challenges.


In stream processing, Machine Learning models make decisions on new data points as soon as they arrive. Such immediate decisions are extremely useful, but not always the best. For example, when we consider anomaly detection on a stream of events, new effects or trends can usually only be detected with confidence some time after they have started. Past decisions will need to be revisited and reclassified - but which ones exactly? Stream processing does not bring a direct answer, and full batch recomputation can be extremely costly. The same problem holds across numerous contexts: recommender systems, process mining, ontology querying, human-in-the-loop planning systems,... How can you gracefully handle data and models which need revisiting with time, while not over-complicating even the simplest data transformations?

During the talk, you will learn the key engineering steps needed to deal with such problems through a reactive data processing design. Achieving such a design was our primary motivation to build Pathway. Pathway is a scalable data processing framework centered around a Python programming interface. It is deployed for processing live operational data in enterprise, including graphs and event streams in real-world supply chains, and is now becoming publicly available in an open-core model.

We will show you design patterns which allow you to easily express reactive Machine Learning logic. We will highlight where it is possible to rely on the usual Python data science stack and external libraries, and where special attention is needed. Most design patterns will feel familiar to users of Pandas or PySpark dataframes, so we will focus on the key differences - and why they are necessary to achieve efficient reactive operation.

In the course of the talk we will do a code demo, and we will show you how to create your own reactive data pipeline and microservice. The example will be a reactive app which predicts the future popularity and sentiment for trending topics in a well-known social network, across different geographies. We will fill in key steps in the code together, and then see it in action in full deployment (with source data API integration + frontend connected with FastAPI).

The talk is addressed to anyone - Machine Learning Engineers, Software Engineers, and Data Engineers - with an interest in building "smart" data pipelines and data products in a real-time or streaming setting. You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle, and the steps needed to overcome these challenges.


Prior Knowledge Expected

No previous knowledge expected

See also: Slides (1.8 MB)

Adrian obtained his PhD in discrete algorithms at the age of 20. He specializes in network science and modeling processes which involve graphs, time, and all things random. During a decade in academia, he ran projects on transportation systems, route planning, and logistics across Europe. He likes to experiment with data with a Python data science stack whenever he can. A big fan of competitive programming - and an even bigger fan of 24-hour contests - Adrian co-founded spoj.com, which has been used by about a million people to boost their programming skills. He happens to be the author of some of the most bizarre problem storylines you will find there. For research audiences, he has talked on topics ranging from synchronization in distributed systems to path-finding algorithms, including two Best Paper talks at major ACM conferences.
As one of the co-creators of Pathway, Adrian has spent the last two years shaping its development directions, contributing code, and being obsessed about usability.