The Beauty of Zarr PyData Global 2022

The Beauty of Zarr
.ical

12-01, 12:30–13:00 (UTC), Talk Track I

In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d be mainly talking about Zarr’s Python implementation and would show how it beautifully interoperates with the existing libraries in the PyData stack.

Zarr is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source technical specification and has implementations in several languages, with Python the most used one. Zarr is NumFOCUS’s sponsored project and is under their umbrella.

Outline:

First, I’d be talking about:

What’s, Why’s, and How’s of Zarr (15 mins.)

How does Zarr work?
- Talking about the motivation and functionality of Zarr
What’s the need for using Zarr?
- When, where and why to use it?
Pluggable compressors and file-storage
- Talking about several compressors and file-storage systems available in Zarr
Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions
- Using inbuilt functions to manage compressed chunks
How Zarr is different when compared to other storage formats?
- Talking briefly about technical specification, which allows Zarr to have implementations in several languages
- Pros and cons when compared to other storage formats
Zarr community
- What is the Zarr community, and how do we do things?

Then, I’d be doing a hands-on session, which would cover:

Hands-on (10 mins.)

Creating and using Zarr arrays
- Using inbuilt functions to create Zarr arrays and reading and writing data to it
Looking under the hood
- Use store functions to explain how your Zarr data is stored
Consolidating metadata
- Consolidating the metadata for an entire group into a single object
Writing and reading from Cloud object storage
- Using S3/GCS/Azure to create Zarr arrays and writing data to it
Showing how Zarr interoperates with the PyData stack
- How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask

I’d be closing the talk by:

Conclusion(5 mins.)

Key takeaway
How you can contribute to Zarr?
QnA

This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone who is curious and wants to learn about Zarr and how to use it is most welcome.

The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room.

Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.

After this talk, you’d learn:

Basic use cases for Zarr and how to use it
Understand the basics of data storage in Zarr
Understand the basics of compressors and file-storage systems in Zarr
Take a better and informed decision on what data format to use for your data

Prior Knowledge Expected –

Previous knowledge expected

Sanket Verma