12-01, 16:30–17:00 (UTC), Talk Track II
Pandas’ current behavior on whether indexing returns a view or copy is confusing, even for experienced users. But it doesn’t have to be this way. We can make this aspect of pandas easier to grasp by simplifying the copy/view rules, and at the same time make pandas more memory-efficient. And get rid of the SettingWithCopyWarning.
Users of pandas probably have run into the infamous “SettingWithCopyWarning”. Several lengthy blog posts and popular stack overflow questions go into the details on what it is and how to deal with it. At the core of this, pandas’ current behavior on whether indexing returns a view or copy is confusing. Pandas’ internals will, for most users, be kind of a black box, and it is hard to reason about how the column’s memory is stored. Even for experienced users, it’s hard to tell whether a view or copy will be returned.
But it doesn’t have to be this way. We can simplify the rules and let any indexing operation or method that returns a new DataFrame always behave as it is a copy (and thus never modifies the original DataFrame when itself being mutated). Using the concept of copy-on-write, we can make this aspect of pandas easier to grasp, and at the same time make pandas more memory-efficient.
In this talk, I will give a brief background on the current internals of pandas related to copies and views and why we have the SettingWithCopyWarning. Then, I will explain the proposal to greatly simplify the rules around copy and view semantics in pandas, and how we can get rid of the SettingWithCopyWarning.
Previous knowledge expected
I am a core contributor to Pandas and Apache Arrow, and maintainer of GeoPandas. I did a PhD at Ghent University and VITO in air quality research and worked at the Paris-Saclay Center for Data Science. Currently, I work at Voltron Data, contributing to Apache Arrow, and am a freelance teacher of python (pandas) at Ghent University.