PyData Global 2022

Building Large-scale, Localized Language Models: From Data Preparation to Training and Deployment to Production.
12-02, 10:00–10:30 (UTC), Talk Track II

Recent advances in natural language processing demonstrate the capability of large-scale language models (such as GPT-3) to solve a variety of NLP problems with zero shots shifting from supervised fine-tuning to prompt engineering/tuning.


However, building large language models raises data preparation, training, and deployment challenges. In addition, while the process is well-established for a few dominant languages, such as English, its execution in localized languages remains limited. We'll give an overview of the end-to-end process for building large-scale language models, discuss the challenges of scaling, and describe some existing solutions for efficient data preparation, distributed training, model optimization, and distributed deployment. We'll use examples in localized languages such as French or Spanish using NVIDIA Nemo Megatron, a framework for training large NLP models optimized for SuperPOD hardware infrastructure.


Prior Knowledge Expected

No previous knowledge expected

Miguel Martínez is a senior deep learning data scientist at NVIDIA, concentrating on Recommender Systems, NLP and Data Analytics. Previously, he mentored students at Udacity's Artificial Intelligence Nanodegree. He has a strong background in financial services, mainly focused on payments and channels. As a constant and steadfast learner, Miguel is always up for new challenges.

This speaker also appears in:

Meriem is a senior Deep Learning data scientist at NVIDIA, supporting partners delivering AI/deep learning solutions. Meriem area of expertise is large scale Natural Language Processing and conversational AI. Meriem holds a Ph.D. in signal and image from Telecom ParisTech, where she studied machine learning applied to audio-visual content.