PyData Global 2022

Scale Data Science by Pandas API on Spark
12-02, 20:00–20:30 (UTC), Talk Track I

With Python emerging as the primary language for data science, pandas has grown rapidly to become one of the standard data science libraries. One of the known limitations in pandas is that it does not scale with your data volume linearly due to single-machine processing.
Pandas API on Spark overcomes the limitation, enabling users to work with large datasets by leveraging Apache Spark. In this talk, we will introduce Pandas API on Spark and help you scale your existing data science workloads using that. Furthermore, we will share the cutting-edge features in Pandas API on Spark.


Python data science has exploded over the last few years. pandas is the de facto standard (single-node) DataFrame implementation in Python, while Apache Spark is the de facto standard for big data processing.
Pandas API on Spark is a new module in Apache Spark that implements pandas API on top of Spark SQL. It enables optimization techniques by Spark Catalyst Optimizer, and provides an easy switch between pandas API and existing PySpark features.
In this talk, we will introduce how Pandas API on Spark optimizes single-machine performance and scales well beyond a single machine. Furthermore, we will highlight the latest updates of Pandas API on Spark.


Prior Knowledge Expected

Previous knowledge expected

Xinrong Meng is a software engineer at Databricks and Apache Spark committer, focusing on PySpark. She is one of the major contributors of Pandas API on Spark.

Takuya Ueshin is a software engineer at Databricks, and an Apache Spark committer and a PMC member. His main interests are in Spark SQL internals as well as PySpark. He is one of the major contributors of pandas API on Spark, a.k.a. the Koalas project.