PyData Global 2022

Speed up Python data processing with vectorization
12-03, 15:30–16:00 (UTC), Talk Track I

You need to quickly process a large amount of data—but running Python code is slow.
Libraries like NumPy and Pandas bridge this performance gap using a technique called vectorization.
In order take full advantage of these libraries to speed up your code, it's helpful to understand what vectorization means and when and how it works.

In this talk you'll learn what vectorization means (there's 3 different definitions!), how it speeds up your code, and how to apply it to your code.


You need to quickly process a large amount of data—but running Python code is slow.
To help bridge this performance gap, the scientific and data science Python communities have built libraries like NumPy and Pandas that speed up computation using a technique called vectorization: batch APIs with fast native processing.

In order to take full advantage of these libraries to speed up your code, it's helpful to understand what vectorization means and when and how it works.
That way you can make sure you're use the fast path, and avoiding code patterns that slow down your code.

In this talk you'll learn:

  • The three definitions of vectorization: API design, native batch processing, and SIMD.
  • How vectorization allows your code to run multiple orders of magnitude faster.
  • How to identify both vectorized code, and code that will run slowly by breaking vectorization.
  • How to turn slow code into fast vectorized code.

The talk presumes some minimal experience with NumPy or Pandas, and most of the examples will involve NumPy.
However, the same principles apply to Pandas as well, and more broadly to many other data processing libraries as well as databases.


Prior Knowledge Expected

No previous knowledge expected

Itamar Turner-Trauring is the creator of Sciagraph, a performance observability service for Python data pipelines, allowing you get performance and memory profiling for your production batch jobs. He is also author of the open source Fil memory profiler for Python. He writes about Python performance at https://pythonspeed.com.