12-02, 14:00–14:30 (UTC), Talk Track II
The Awkward Array project provides a library for operating on nested,
variable length data structures with NumPy-like idioms. We present two
projects that provide native support for Awkward Arrays in the broader
PyData ecosystem. In dask-awkward we have implemented a new Dask
collection to scale up and distribute workflows with partitioned
Awkward Arrays. In awkward-pandas we have implemented a new Pandas
extension array type, making it easy to use Awkward Arrays in Pandas
workflows and enabling massive acceleration in the processing of
nested data. We will show how these projects plug into PyData and
present some compelling use cases.
Dask provides core collections for scaling up workflows that use NumPy
arrays and Pandas DataFrames. The collection interface defined by Dask
allows for the creation of custom collections. We will describe how we
created a collection to bring support to Awkward Arrays in Dask, and
explain some of the advantages of using a native collection over
alternatives in the Dask ecosystem: for example, we are able to
leverage Dask's high-level task graph layers and implement dedicated
optimizations.
Pandas provides the ExtensionArray interface to create and register
new types of Arrays that Pandas can recognize. We will show, with
example use-cases, how adding a new Awkward ExtensionArray improves
the performance of operating on nested data in Pandas workflows. For
example, Python for-loops in nested Pandas workflows can be sped of by
more than 100x with equivalent Awkward use.
The core purpose of developing dask-awkward and awkward-pandas is to
make nested, JSON-like data a first class type in the PyData ecosystem
that can be analyzed efficiently and at scale.
Previous knowledge expected
I'm a software engineer at Anaconda, where I contribute to the open source PyData/Scientific Python ecosystem. I primarily work in the Dask community. I decided to be a software engineer after training to be a particle physicist.