PyData Global 2022

Supercharge your training on TPUs
12-01, 11:00–11:30 (UTC), Talk Track I

This session will discuss scaling your PyTorch models on TPUs. We’ll also cover an overview of ML accelerators and distributed training strategies. We’ll cover training on TPUs from beginning to end, including setting them up, TPU architecture, frequently faced issues, and debugging techniques. You’ll learn about the experience of using the PyTorch XLA library and explore best practices for getting started with training large-scale models on TPUs.


Outline of the Session

Part 1: Overview of ML Accelerators

Accelerator refers to the hardware being used for the training and inference of machine learning models. We will cover several ML accelerators, such as CPUs, GPUs, TPUs, IPUs, and HPUs, in brief.

Part 2: Fundamentals of Distributed Training

We will discuss the core principles of distributed training in machine learning. We'll also talk about why we need it and why it's so complicated. Then, we'll go over two fundamental approaches to distributed training in depth: Data and Model Parallelism.

Part 3: TPU Accelerator & PyTorch XLA at Scale

We will go over the TPU Accelerator in depth by discussing its architecture, setting it up, and getting started with training large-scale models on TPUs to speed up your machine learning workloads. We’ll cover training on TPUs from beginning to end, including setting them up, TPU architecture, frequently faced issues, and debugging techniques.

Who is it aimed at?

Data scientists and ML engineers, who may or may not have used PyTorch XLA in the past and wish to use distributed training for their models on TPUs.

What will the audience learn by attending the session?

Learn about ML Accelerators and
Get an overview of distributed training and the TPU Accelerator
Train a model with PyTorch XLA on TPUs with ease

Background Knowledge:

Some familiarity with Python, deep learning terminology, and the basics of neural networks


Prior Knowledge Expected

No previous knowledge expected

Kaushik Bokka is a Senior Research Engineer at Lightning AI and one of the core maintainers of the PyTorch Lightning library. He has prior experience in building production scale Machine Learning and Computer Vision systems for several products ranging from Video Analytics to Fashion AI workflows. He has also been a contributor to a few other open-source projects and aims to empower the way people and organizations build AI applications.