PyData Global 2022

Supercharging your pandas workflows with Modin
12-02, 09:30–10:00 (UTC), Talk Track I

Data practitioners are typically forced to choose between tools that are either easy to use (pandas) or highly scalable (Spark, SQL..etc.). Modin, an open source project originally developed by researchers at UC Berkeley, is a highly scalable, drop-in replacement for pandas.

This talk will give an overview of Modin and practical examples on how to use it to effortlessly scale up your pandas workflows.


pandas is one of the most popular data science libraries, with between 5-10M users, used by data scientists to clean, analyze, featurize, explore, transform, and model data. However, pandas breaks down at scale, making it difficult for end users to move beyond small, toy datasets and generalize their insights.

Modin is a highly scalable, drop-in replacement for pandas. The open source project has been downloaded 4 million times and is in use by teams in the Fortune 500 as well as high growth technology companies. Grounded in years of research and development at UC Berkeley’s RISE lab, Modin eliminates the complexity of working directly with distributed systems and lets users continue to use the pandas syntax at massive scale.

This talk will give you an overview of Modin and walk you through practical examples so you can try it yourself. Our demo will explore the use of Modin in conjunction with the popular HuggingFace NLP Transformer library.

To learn more about the project visit https://github.com/modin-project/modin


Prior Knowledge Expected

No previous knowledge expected

Alejandro Herrera is a Solution Architect at Ponder.

Ponder provides enterprise-ready tools in Python for rapid, flexible experimentation with data at scale. Ponder makes data teams more productive by enabling them to get insights faster with tools they know and love.