PyData Global 2022

Daft: the Distributed Python Dataframe for "Complex Data" (images, video, documents and more)
12-02, 19:30–20:00 (UTC), Talk Track II

Daft is an open-sourced distributed dataframe library built for "Complex Data" (data that doesn't usually fit in a SQL table such as images, videos, documents etc).

Experiment Locally, Scale Up in the Cloud

Daft grows with you and is built to run just as efficiently/seamlessly in a notebook on your laptop or on a Ray cluster consisting of thousands of machines with GPUs.

Pythonic

Daft lets you have tables of any Python object such as images/audio/documents/genomic files. This makes it really easy to process your Complex Data alongside all your regular tabular data. Daft is dynamically typed and built for fast iteration, experimentation and productionization.

Blazing Fast

Daft is built for distributed computing and fully utilizes your all of your machine's or cluster's resources. It uses modern technologies such as Apache Arrow, Parquet and Iceberg for optimizing data serialization and transport.


Daft (https://www.getdaft.io) is an open-sourced dataframe framework:

  1. Pythonic and built for "Complex Data" such as images, video and unstructured documents. Columns of the dataframe can be of any arbitrary Python type such as Numpy vectors, PIL Images or any user-defined type! Daft exposes an easy functional interface for loading, querying and processing this data.

  2. Built for both interactive experimentation and distributed computing. Daft is built for a smooth local development experience in a REPL/notebook environment with a dynamic type system and intelligent caching. When running large workloads that require more computing power, it scales up seamlessly to thousands of machines on a cluster using Ray.

  3. Built for Machine Learning workloads - Daft is perfect for performing data curation for ML training, or scaling up large scale ML inference. It integrates natively with the Ray and PyTorch ecosystem for training input data, efficiently transporting your data into ML training jobs.


Prior Knowledge Expected

No previous knowledge expected

Jay is a cofounder of Eventual and a primary contributor to the Daft open-sourced project. Prior to Eventual, he was a software engineer building large scale ML data systems for computational biology at Freenome and self-driving cars at Lyft. He hails from the sunny island nation of Singapore, and used to command a platoon of tanks in the Singapore military.

Sammy Sidhu is co-founder and CEO of Eventual. Sammy's background is in High Performance Computing (HPC) and Deep Learning and has over a dozen patents/publications in the space. In the past, he has worked on high frequency trading on wall street, medical AI research at Berkeley and self-driving cars at both DeepScale (acquired by Tesla) and Lyft Level 5 (acquired by Toyota). Native to the Bay Area, Sammy graduated from UC Berkeley with a degree in Electrical Engineering and Computer Science