PyData Global 2022

How to Eliminate the I/O Bottleneck and Continuously Feed the GPU While Training in the Cloud
12-02, 17:30–18:00 (UTC), Talk Track I

Model training is a time-consuming, data-intensive, and resource-hungry phase in machine learning, with much use of storage, CPUs, and GPUs. The data access pattern in training requires frequent I/O of a massive number of small files, such as images and audio files. With the advancement of distributed training in the cloud, it is challenging to maintain the I/O throughput to keep expensive GPUs highly utilized without waiting for access to data. The unique data access patterns and I/O challenges associated with model training compared to traditional data analytics necessitate a change in the architecture of your data platform.


In this talk, Lu Qiu will introduce a new architecture to optimize I/O in the entire data pipeline and maintain the throughput required by the GPU. Also, she will share how to implement this new architecture in Kubernetes for Pytorch workloads in the public cloud.


Prior Knowledge Expected

Previous knowledge expected

Lu Qiu is a machine learning engineer at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management. Lu receives an M.S. degree from George Washington University in Data Science.