PyData Global 2022

How to maximally parallelize the entire pandas API
12-02, 11:00–11:30 (UTC), Talk Track I

pandas has rapidly become one of the most popular tools for data analysis, but is limited by its inability to scale to large datasets. We developed Modin, a scalable, drop-in alternative to pandas, that preserves the dynamic and flexible behavior of pandas dataframes while enhancing the scalability.

This talk will walk you through our team’s research at UC Berkeley, which enabled the development of Modin. We’ll also discuss our latest publication at VLDB, which covers a novel approach to parallelization and metadata management techniques for dataframes.

pandas has rapidly become the tool of choice for most data scientists - its flexibility and compatibility with many other data science libraries enables rapid prototyping and iteration. In production; however, pandas often falls flat due to its inability to scale - pandas is single threaded, and only operates ion memory. This introduces a barrier in the data science workflow: data scientists prefer to use pandas to author their data science workflows; however, to run the same workflows at scale, they need to be rewritten to use more traditional data analytic tools, like Spark or SQL. We have been developing Modin, a scalable drop-in replacement for pandas that allows users to operate on data at scale, without having to rewrite their pandas workflows.

However, pandas’ extensive API features around 600 operators, making it difficult to optimize at scale. To address the scaling challenges, we draw inspiration from relational databases to develop a dataframe algebra, which can be composed to express any pandas query. By designing an extensible translation layer from the pandas API to our dataframe algebra, we enable Modin to work on distributed data, as well as optimize queries to reduce latency. This modular architecture, combined with Modin’s decomposition rules for dataframes, and its metadata independence, allow it to deliver performance at scale.

This talk will discuss the architecture of Modin, as well as delve into the key research insights developed at UC Berkeley that enabled its creation. Systems Researchers and Data science practitioners can expect to learn about the core design principles underlying Modin, as well as get a deep dive into Modin as a system.

Prior Knowledge Expected

No previous knowledge expected

Rehan Durrani is one of the core developers of Modin and a founding engineer at Ponder. Rehan studied Electrical Engineering and Computer Science at UC Berkeley and has contributed to leading open source research projects including Modin and Ray, as well as lead open source research projects like Clipper. His work on Modin has been published in leading publications, including VLDB.