12-01, 12:30–13:00 (UTC), Talk Track I
In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d be mainly talking about Zarr’s Python implementation and would show how it beautifully interoperates with the existing libraries in the PyData stack.
Zarr is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source technical specification and has implementations in several languages, with Python the most used one. Zarr is NumFOCUS’s sponsored project and is under their umbrella.
First, I’d be talking about:
What’s, Why’s, and How’s of Zarr (15 mins.)
- How does Zarr work?
- Talking about the motivation and functionality of Zarr
- What’s the need for using Zarr?
- When, where and why to use it?
- Pluggable compressors and file-storage
- Talking about several compressors and file-storage systems available in Zarr
- Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions
- Using inbuilt functions to manage compressed chunks
- How Zarr is different when compared to other storage formats?
- Talking briefly about technical specification, which allows Zarr to have implementations in several languages
- Pros and cons when compared to other storage formats
- Zarr community
- What is the Zarr community, and how do we do things?
Then, I’d be doing a hands-on session, which would cover:
Hands-on (10 mins.)
- Creating and using Zarr arrays
- Using inbuilt functions to create Zarr arrays and reading and writing data to it
- Looking under the hood
- Use store functions to explain how your Zarr data is stored
- Consolidating metadata
- Consolidating the metadata for an entire group into a single object
- Writing and reading from Cloud object storage
- Using S3/GCS/Azure to create Zarr arrays and writing data to it
- Showing how Zarr interoperates with the PyData stack
- How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask
I’d be closing the talk by:
- Key takeaway
- How you can contribute to Zarr?
This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone who is curious and wants to learn about Zarr and how to use it is most welcome.
The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room.
Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.
After this talk, you’d learn:
- Basic use cases for Zarr and how to use it
- Understand the basics of data storage in Zarr
- Understand the basics of compressors and file-storage systems in Zarr
- Take a better and informed decision on what data format to use for your data
Previous knowledge expected
Sanket is a data scientist based out of New Delhi, India. He likes to build data science tools and products and has worked with startups, government and organisations. He loves building community and bringing everyone together and is Chair of PyData Delhi and PyData Global. Currently, he's taking care of the community and OSS at Zarr as their Community Manager.
When he’s not working, he likes to play the violin and computer games and sometimes thinks of saving the world!