PyData Global 2022

Inequality Joins in Pandas with Pyjanitor
12-01, 10:00–10:30 (UTC), Talk Track I

Inequality joins are less frequent than equality joins, but are useful in temporal analytics and even in some conventional applications. Pyjanitor fills this gap in Pandas with an efficient implementation


Imagine a manufacturer wishing to minimise the cost of storage while maximising profits (increasing the inventory of the more profitable product, while decreasing the storage for the less profitable product), or a tax audit to find out which employers earn more, but pay less tax. It could be as simple as efficiently finding the range of dates that dates from another dataframe fit into. These are problems that can be solved by inequality joins. At the moment, the way to solve this in Pandas is with a cartesian join, which can be expensive memory wise, and generally is inefficient. This talk aims to show a better, more efficient way of solving inequality joins within Pandas.

The talk will contain a description of the algorithms implemented, as well as some speed tests with regards to performance.


Prior Knowledge Expected

No previous knowledge expected

Data Engineer, love open source, contributor to pydatatable and pyjanitor.