PyData Global 2022

Anomaly Detection on Streaming Data in Python using Bytewax and River
12-01, 15:00–16:30 (UTC), Workshop/Tutorial II

Bytewax is an open source, Python native, framework and distributed processing engine for processing data streams that makes it easy to build everything from pipelines for anonymizing data to more sophisticated systems for fraud detection, personalization, and more. For this tutorial, we will cover how you can use Bytewax and the Python library, River, to build an online machine learning system that will detect anomalies in IoT data from streaming systems like Kafka and Redpanda. This tutorial is for data scientists, data engineers, and machine learning engineers interested in machine learning and streaming data. At the end of the tutorial session you will know how to:
- run a streaming platform like Kafka or Redpanda in a docker container,
- develop a Bytewax dataflow
- run a River anomaly detection algorithm to detect anomalous data

The tutorial material will be available via a GitHub Repo and the content will be covered in roughly the timeline shown below.

  • 0-10min - Introduction to stream processing and online machine learning
  • 10-30min - Setup streaming system and prepare the data
  • 30-60min - Write the Bytewax dataflow and anomaly detector code
  • 60-90min - Tune the anomaly detector and run the Bytewax dataflow successfully.

The proliferation of connected devices, from smart appliances to connected cars, has created a landslide of data. Detecting whether or not these (sometimes) massive networks of connected devices are functioning properly in real time can be beyond the capability of humans. In order to analyze whether there is a problem is not only a problem of volume but also of changing environmental variables.

To help us build systems that can scale to the volume of data and that can handle the changing environments, we can use two Python tools - Bytewax and River. Bytewax is a stateful data processing framework and engine and will allow us to scale our processing to meet the volume requirements through parallelization. River is a Python library focused on online machine learning where the model is updated incrementally and stored in state. The algorithms used are particularly well suited for dynamic environments.

In this tutorial session, you will get a better understanding of how you can use online machine learning algorithms to detect anomalies across hundreds of sensors. This session will guide you through how to set up a development environment with a streaming system (Kafka or similar), load sensor data to the streaming system with Bytewax, and write a dataflow that will transform the data and use different anomaly detection algorithms to determine if there are anomalies in the sensor data.

Prior Knowledge Expected

No previous knowledge expected

Zander is the CEO and Founder of Bytewax - a Python stream processing framework. His previous experience prior to Bytewax had been as a data scientist at GitHub and Heroku. He lives in Santa Cruz, California and when not at his computer likes to get outdoors.