PyData Global 2022

You don't need a cluster for that: using embedded SQL engines for plotting massive datasets on a laptop
12-02, 22:00–22:30 (UTC), Talk Track I

This talk will show you a simple yet effective technique to visualize larger-than-memory datasets on your laptop by leveraging SQLite or DuckDB. No need to spin up a Spark cluster!


Data visualization is an essential skill for every data practitioner. The typical approach for visualizing data is to use pandas for data cleaning and matplotlib (or seaborn) for visualization. However, this approach falls short when we want to visualize large datasets since pandas and matplotlib require you to load the entire dataset into memory.

In such cases, a practitioner might think of using a Spark or Dask cluster; however, this adds a lot of complexity since the cluster needs configuration and maintenance. Furthermore, this might not be possible if you don't have access to such infrastructure.

This talk will show you how to use SQLite (or DuckDB) to plot massive datasets from your laptop efficiently. With this approach, there is no need to maintain extra infrastructure, and you'll be able to plot datasets that do not fit into memory without additional infrastructure.

Outline:
[0 - 4 minute] Why pandas fails
[4 - 10] How can SQLite/DuckDB helps us scale data visualization
[10 - 18] Use case: plotting histograms
[18 - 26] Use case: plotting boxplots
[26 - 28] Summary and conclusions
[28 - 30] Q&A


Prior Knowledge Expected

Previous knowledge expected

Eduardo Blancas is the Co-Founder and CEO of Ploomber, a Y Combinator-backed company developing tools to bridge the gap between interactive data work and production. Before that, he was a Data Scientist at Fidelity Investments, where he deployed the first customer-facing Machine Learning model for asset management. Eduardo holds an M.S. in Data Science from Columbia University and a B.S. in Mechatronics Engineering from Tecnológico de Monterrey.

This speaker also appears in: