0.43
cfp
PyData Global 2022
2022-12-01
2022-12-03
3
00:05
https://global2022.pydata.org/cfp/schedule/
UTC
2022-12-01T08:00:00+00:00
08:00
00:30
Talk Track I
cfp-224-generate-actionable-counterfactuals-using-multi-objective-particle-swarm-optimization
https://global2022.pydata.org//cfp/talk/7BCESD/
false
Generate Actionable Counterfactuals using Multi-objective Particle Swarm Optimization
Talk
en
Counterfactual explanations (CFE) are methods that
explain a machine learning model by giving an alternate class prediction
of a data point with some minimal changes in its features.
In this talk, we describe a counterfactual (CF)
generation method based on particle swarm optimization (PSO) and how we can have greater control over the proximity and sparsity properties
over the generated CFs.
Counterfactual explanations (CFE) are methods that
explain a machine learning model by giving an alternate class prediction
of a data point with some minimal changes in its features.
It helps the users to identify their data attributes that caused
an undesirable prediction like a loan or credit card rejection.
We describe an efficient, and an actionable counterfactual (CF)
generation method based on particle swarm optimization (PSO).
We describe a simple objective function for the optimization of
instance-centric CF generation problem. The PSO brings in a lot
of flexibility in terms of carrying out multi-objective optimization
in large dimensions, capability for multiple CF generation and
setting box constraints or immutability of data attributes therby
enables greater control over the proximity and sparsity properties
over the generated CFs.
#### Keywords
- machine learning, counterfactual explanation, pso , explainability
#### Outline
- Introduction to Counterfactual Analysis (CFA) and challenges
- Introduction to Particle Swarm Optimization (PSO)
- Describe best practices for generating Counterfactuals and the proposed method
- Demo using Jupyter notebook
- Questions and answers
#### Timeline
This talk done will be done in 3 parts -
1. Introduce the Background (0:00-0:05 )
- Introduce CFA and PSO
- Current challenges
2. Describe the method proposed and best practices (0:05 - 0:20)
- Best practices
- Theoretical explanation of the proposed method
3. Demo on Jupyter notebook (0:20 - 0:30)
#### Key Takeaways
- Understanding how counterfactual analysis can be done using PSO.
- How to leverage multi-objective optimization to control proximity, validity, sparsity and diversity of counterfactuals generated.
- Application in real-world usecase.
Niranjan G SSHASHANK SHEKHAR
2022-12-01T08:30:00+00:00
08:30
00:30
Talk Track I
cfp-222-measurement-of-trust-in-ai
https://global2022.pydata.org//cfp/talk/RR3A9Y/
false
Measurement of Trust in AI
Talk
en
For enterprises to adopt and embrace AI into their transformational journey, it is imperative to build Trustworthy AI- so that AI products and solutions that are built, delivered, and acquired are responsible enough to drive trust and wider adoption. We look at AI Trust as a function of 4 key constructs which include Reliability, Safety, Transparency, Responsibility and Accountability. These core constructs are pillars of driving AI trust in our products and solutions. In this talk, I will explain how to enable each core construct and will articulate how they can be measured in some real-world use cases.
After several iterations, rigorous research and based on our collective experiences – we established 4 pillars of trust:
• Any AI system should be strong on performance expectations on a given use case
• Any decisions from the model are explainable and traceable
• Any biased decisions towards a specific group of population need to be safeguarded
• Lastly, the model needs to be secured from any malicious or non-malicious attacks
These pillars of AI trust are further broken down into several dimensions and metrics that are quantifiable.
This talk will be delivered in three parts.
Part 1: Need of Trust in AI with real-life examples – 5 minutes
Part 2: Construct, dimensions, and influencing factors of trust in AI – 10 minutes
Part 3: Framework for AI trust calculator leveraging real-life use-cases – 15 minutes
/media/cfp/submissions/RR3A9Y/AI_Trust_JBYPgjN.JPG
SHASHANK SHEKHAR
2022-12-01T09:00:00+00:00
09:00
00:30
Talk Track I
cfp-69-expressive-and-fast-dataframes-in-python-with-polars
https://global2022.pydata.org//cfp/talk/TWB8CA/
false
Expressive and fast dataframes in Python with polars
Talk
en
The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays several open-source projects claim to improve pandas in various ways, either by bringing it to a distributed computing setting (Dask), accelerating its performance with minimal changes (Modin), or offering slightly different API that solves some of its shortcomings (Polars).
In this talk we will dive into Polars, a new dataframe library backed by Arrow and Rust that offers an expressive API for dataframe manipulation with excellent performance.
If you are a seasoned pandas user willing to explore alternatives, or a beginner user wondering what all the fuzz about these new dataframe libraries is, this talk is for you!
The outline of the talk goes as follows:
1. We will make a very brief introduction to pandas, we will talk about its importance, and we will point out some of its shortcomings ([as its own creator did half a decade ago](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) (10 minutes)
2. We will enumerate some of the current pandas alternatives and classify them (pandas-like vs bespoke, single-node vs distributed) (5 minutes)
3. We will do a live demo of how to analyze and manipulate a relatively big dataset using Polars inside [Orchest Cloud](https://www.orchest.io/) y and showcase some of its unique capabilities (10 minutes).
4. Recommendations and conclusions (5 minutes).
After the talk, you will have more information on how some of the modern alternatives to pandas fit into the ecosystem, and will understand why Polars is so exciting and promising. Prior exposure to data manipulation with Python (not necessarily with pandas) will help make the most of the presentation.
The talk will build upon this blog post about [Polars](https://www.orchest.io/blog/the-great-python-dataframe-showdown-part-3-lightning-fast-queries-with-polars).
Juan Luis Cano Rodríguez
2022-12-01T10:00:00+00:00
10:00
00:30
Talk Track I
cfp-47-inequality-joins-in-pandas-with-pyjanitor
https://global2022.pydata.org//cfp/talk/ZNMNCX/
false
Inequality Joins in Pandas with Pyjanitor
Talk
en
Inequality joins are less frequent than equality joins, but are useful in temporal analytics and even in some conventional applications. Pyjanitor fills this gap in Pandas with an efficient implementation
Imagine a manufacturer wishing to minimise the cost of storage while maximising profits (increasing the inventory of the more profitable product, while decreasing the storage for the less profitable product), or a tax audit to find out which employers earn more, but pay less tax. It could be as simple as efficiently finding the range of dates that dates from another dataframe fit into. These are problems that can be solved by inequality joins. At the moment, the way to solve this in Pandas is with a cartesian join, which can be expensive memory wise, and generally is inefficient. This talk aims to show a better, more efficient way of solving inequality joins within Pandas.
The talk will contain a description of the algorithms implemented, as well as some speed tests with regards to performance.
samuel oranyeli
2022-12-01T10:30:00+00:00
10:30
00:30
Talk Track I
cfp-76-data-centric-ai-cookbook-let-s-prep-that-data
https://global2022.pydata.org//cfp/talk/YBVUM7/
false
Data-Centric AI Cookbook: let's prep that data
Talk
en
Data Centric AI is about iterating on *data* instead of models to improve machine learning predictions. Why is this trend relevant *now*? Is this yet another hype in data science? Or has something really changed? And most of all -- how is this relevant to *you*?
Data Centric AI is about iterating on *data* instead of models to improve machine learning predictions. Why is this trend relevant *now*? Is this yet another hype in data science? Or has something really changed? And most of all -- how is this relevant to *you*?
In this talk, we'll cover the benefits of a data-centric AI approach – *spoilers:* increased performance is one of them! –, and cover practical tips on how you can integrate data-centric principles in your daily work.
Andrew Ng once said *"Data is the food for AI"* when talking about data-centric AI. If that's the case, this talk will provide you with the recipes.
Marysia Winkels
2022-12-01T11:00:00+00:00
11:00
00:30
Talk Track I
cfp-150-supercharge-your-training-on-tpus
https://global2022.pydata.org//cfp/talk/CSJBCY/
false
Supercharge your training on TPUs
Talk
en
This session will discuss scaling your PyTorch models on TPUs. We’ll also cover an overview of ML accelerators and distributed training strategies. We’ll cover training on TPUs from beginning to end, including setting them up, TPU architecture, frequently faced issues, and debugging techniques. You’ll learn about the experience of using the PyTorch XLA library and explore best practices for getting started with training large-scale models on TPUs.
**Outline of the Session**
**Part 1: Overview of ML Accelerators**
Accelerator refers to the hardware being used for the training and inference of machine learning models. We will cover several ML accelerators, such as CPUs, GPUs, TPUs, IPUs, and HPUs, in brief.
**Part 2: Fundamentals of Distributed Training**
We will discuss the core principles of distributed training in machine learning. We'll also talk about why we need it and why it's so complicated. Then, we'll go over two fundamental approaches to distributed training in depth: Data and Model Parallelism.
**Part 3: TPU Accelerator & PyTorch XLA at Scale**
We will go over the TPU Accelerator in depth by discussing its architecture, setting it up, and getting started with training large-scale models on TPUs to speed up your machine learning workloads. We’ll cover training on TPUs from beginning to end, including setting them up, TPU architecture, frequently faced issues, and debugging techniques.
**Who is it aimed at?**
Data scientists and ML engineers, who may or may not have used PyTorch XLA in the past and wish to use distributed training for their models on TPUs.
**What will the audience learn by attending the session?**
Learn about ML Accelerators and
Get an overview of distributed training and the TPU Accelerator
Train a model with PyTorch XLA on TPUs with ease
**Background Knowledge:**
Some familiarity with Python, deep learning terminology, and the basics of neural networks
Kaushik Bokka
2022-12-01T11:30:00+00:00
11:30
00:30
Talk Track I
cfp-118-knowing-what-you-don-t-know-matters-uncertainty-aware-model-rating
https://global2022.pydata.org//cfp/talk/3HJRZM/
false
Knowing what you don’t know matters: Uncertainty-aware model rating
Talk
en
Meaningful probabilistic models do not only produce a “best guess” for the target, but also convey their uncertainty, i.e., a belief in how the target is distributed around the predicted estimate. Business evaluation metrics such as mean absolute error, a priori, neglect that unavoidable uncertainty. This talk discusses why and how to account for uncertainty when evaluating models using traditional business metrics, using python standard tooling. The resulting uncertainty-aware model rating satisfies the requirements of statisticians because it accounts for the probabilistic process that generates the target. It should please practitioners because it is based on established business metrics. It appeases executives because it allows concrete quantitative goals and non-defensive judgements.
This talk will equip you - a Data Scientist or a person working with Data Scientists - with the background and tooling necessary to rate models in an uncertainty-aware fashion. You’ll learn to establish “best-case” and “worst-case”-benchmarks and judge your models against these. This will help you answering the question “how good is the model, and how good could it become?” in a non-defensive way, beyond merely computing standard evaluation metrics.
Why is uncertainty-awareness important? Well, would you, as a Data Scientist, accept to be evaluated on reducing mean absolute error of some model by 50%? Probably not - but would you feel comfortable explaining why? This talk will establish why such generic and ad hoc goal setting is not meaningful, why model judgement is harder than expected, and how it can still be done reliably and without too many technicalities. We will exemplify uncertainty-aware model rating using the M5 competition data (Walmart sales numbers). Some immediate interpretations (“model A is clearly better than model B”) will turn out to be flawed upon closer inspection, and we’ll see how to correct them, using standard python libraries (numpy, scipy, pandas).
Takeaways:
- We are typically too self-confident in our skills when it comes to judging models. Instead of jumping to immediate conclusions (“that 80%-error model is bad!”), we should take a step back, build a “reasonable best case”, and benchmark the candidate against that.
- Accepting and dealing with uncertainty is strength, neglecting it is madness. Uncertainty-aware model rating allows us to make reliable statements about “how good the model really is”, without “it depends” and “buts”.
- Requirements from statisticians and business stakeholders can be reconciled by taking both of them seriously. Standard python tooling suffices to improve our modelling of what we know that we don't know.
Malte Tichy
2022-12-01T12:00:00+00:00
12:00
00:30
Talk Track I
cfp-249-teaching-papermill-new-tricks-creating-custom-engines-for-flexible-notebook-execution
https://global2022.pydata.org//cfp/talk/VJ3T8A/
false
Teaching papermill new tricks: creating custom engines for flexible notebook execution
Talk
en
This talk will show you how to build papermill plugins. As motivating examples, we'll describe how to customize papermill for notebook debugging and profiling.
Papermill is a widely used library for executing notebooks programmatically; however, most users are unaware of its plugin mechanism that allow us to customize notebook execution.
This talk will explain how we can create papermill plugins; furthermore, as motivating examples, we'll show how to implement two plugins: one for notebook profiling and another for notebook debugging. By the end of the talk, attendees will be able to implement papermill plugins. We'll also provide example code so they can build on top of the two example use cases.
Outline
[0 - 2 minute] Introduction to papermill
[2 - 6] papermill's plugin system
[6 - 10] Creating a new engine
[10 - 18] Use case: notebook profiling
[18 - 26] Use case: notebook debugging
[26 - 28] Summary and conclusions
[28 - 30] Q&A
Eduardo Blancas
2022-12-01T12:30:00+00:00
12:30
00:30
Talk Track I
cfp-100-the-beauty-of-zarr
https://global2022.pydata.org//cfp/talk/DQSXAX/
false
The Beauty of Zarr
Talk
en
In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d be mainly talking about Zarr’s Python implementation and would show how it beautifully interoperates with the existing libraries in the PyData stack.
Zarr is a data format for storing chunked, compressed N-dimensional arrays. Zarr is based on open-source technical specification and has implementations in several languages, with Python the most used one. Zarr is NumFOCUS’s sponsored project and is under their umbrella.
### Outline:
First, I’d be talking about:
### What’s, Why’s, and How’s of Zarr (15 mins.)
- How does Zarr work?
- Talking about the motivation and functionality of Zarr
- What’s the need for using Zarr?
- When, where and why to use it?
- Pluggable compressors and file-storage
- Talking about several compressors and file-storage systems available in Zarr
- Managing(selection, resizing, writing, reading) chunked arrays using Zarr functions
- Using inbuilt functions to manage compressed chunks
- How Zarr is different when compared to other storage formats?
- Talking briefly about technical specification, which allows Zarr to have implementations in several languages
- Pros and cons when compared to other storage formats
- Zarr community
- What is the Zarr community, and how do we do things?
Then, I’d be doing a hands-on session, which would cover:
### Hands-on (10 mins.)
- Creating and using Zarr arrays
- Using inbuilt functions to create Zarr arrays and reading and writing data to it
- Looking under the hood
- Use store functions to explain how your Zarr data is stored
- Consolidating metadata
- Consolidating the metadata for an entire group into a single object
- Writing and reading from Cloud object storage
- Using S3/GCS/Azure to create Zarr arrays and writing data to it
- Showing how Zarr interoperates with the PyData stack
- How Zarr interoperates with the PyData stack(NumPy, Dask and Xarray) and how you can write data to your Zarr chunks at incredibly high speed in parallel using Dask
I’d be closing the talk by:
### Conclusion(5 mins.)
- Key takeaway
- How you can contribute to Zarr?
- QnA
This talk aims to address the audience who works with large amounts of data and are in search of a data format which is transparent, easy to use and friendly to the environment. Zarr is also reasonably used in bioimaging, geospatial and research communities. So, Zarr is your one-stop solution if you’re from a community or an organisation dealing with high-volume data. Also, anyone who is curious and wants to learn about Zarr and how to use it is most welcome.
The tone of the talk is set to be informative, along with a hands-on session. Also, I’m happy to adjust the style according to the audience in the room.
Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk.
### After this talk, you’d learn:
- Basic use cases for Zarr and how to use it
- Understand the basics of data storage in Zarr
- Understand the basics of compressors and file-storage systems in Zarr
- Take a better and informed decision on what data format to use for your data
/media/cfp/submissions/DQSXAX/zarr_wz7zGTJ.png
Sanket Verma
2022-12-01T13:30:00+00:00
13:30
00:30
Talk Track I
cfp-78-data-science-project-patterns-that-work
https://global2022.pydata.org//cfp/talk/9GYEJB/
false
Data Science Project Patterns that Work
Talk
en
Getting your team to choose good projects, reliably derisk them, research ideas, productionise the solutions and create positive change in an organisation is hard. Really hard.
I'll present patterns that work for these 5 critical project stages. This guidance is based on 15 years of experience writing AI and DS solutions and 5 years giving both strategic guidance training on how to get to success.
You'll come away from the session with new techniques to help your team deliver successfully and increase their confidence in the roadmap, new thoughts on how to diagnose your model's quality and new ideas to make positive difference in your organisation.
Getting your team to choose good projects, reliably derisk them, research ideas, productionise the solutions and create positive change in an organisation is hard. Really hard.
I'll present patterns that work for these 5 critical project stages. This guidance is based on 15 years of experience writing AI and DS solutions and 5 years giving both strategic guidance training on how to get to success.
You'll come away from the session with new techniques to help your team deliver successfully and increase their confidence in the roadmap, new thoughts on how to diagnose your model's quality and new ideas to make positive difference in your organisation.
Ian will use case studies based on real client engagements (with positive and negative outcomes!) to illustrate the advice.
/media/cfp/submissions/9GYEJB/headsbodyhot_2017_big_nYnZrZw.jpg
Ian Ozsvald
2022-12-01T14:00:00+00:00
14:00
01:00
Talk Track I
cfp-291-keynote-ada-nduka-oyom
https://global2022.pydata.org//cfp/talk/Q8NDNK/
false
Keynote - Ada Nduka Oyom
Keynote
en
Ada is the Founder of She Code Africa (SCA).
Ada is the Founder of She Code Africa (SCA), a non-profit organisation focused on empowering young girls and women in Africa through technical skills and Co-founder, Open Source Community Africa, one of the largest communities for open-source enthusiasts, advocates and experts across Africa. She’s currently engaged with Google as the Ecosystem community manager for Sub-saharan Africa.
/media/cfp/submissions/Q8NDNK/Screen_Shot_2022-11-16_at_3.17.10_PM_1cPE7kI.png
Ada Nduka Oyom
2022-12-01T15:00:00+00:00
15:00
00:30
Talk Track I
cfp-284-algorithms-at-scale-raising-awareness-on-latent-inequities-in-our-data
https://global2022.pydata.org//cfp/talk/FRFVYQ/
false
Algorithms at Scale: Raising Awareness on Latent Inequities in Our Data
Talk
en
In today’s digital age, we use machine learning (ML) and artificial intelligence (AI) to solve problems and improve productivity and efficiency. Yet, there’s risk in delegating decision-making power to algorithmically based systems: their workings are often opaque, turning them into uninterpretable “black boxes.”
This risk is especially acute when algorithms are tasked with making life-changing decisions (e.g., legal, law enforcement, credit scoring, and risk assessment), and it can be difficult to know if an AI/ML-based decision was made in a fair manner, reflecting the values of society. To be clear, this “black box” nature of AI/ML doesn’t necessarily imply that these algorithms were designed with malicious intent. These possible negative outcomes simply arise from the power and complexity of AI/ML algorithms at scale, combined with potential inequities latent in the data used to train the model. The question remains, however: if we permit AI/ML to make life-altering decisions, what are the implications for our social, economic, technical, legal, and environmental systems?
Significant work has been done to try to solve this challenge, leading to development of over 160 “ethical AI principles,” with the goal of providing guidance to organizations to act responsibly to avoid causing societal harm. However, although the intentions of this work are good, this maelstrom of guidance, none of which is compulsory, can sometimes add confusion instead of clarity.
It’s important to think carefully about how we implement these algorithms, and how we delegate decisions and data usage, given the difficulty of enacting effective human oversight and governance over AI/ML-based decision making. This talk focuses on harmonizing and aligning approaches, illustrating the opportunities and threats of AI, while raising awareness of society’s responsibility to demystify governance complexity and to establish an equitable digital society.
Dr. Lalitha Krishnamoorthy
2022-12-01T16:00:00+00:00
16:00
00:30
Talk Track I
cfp-262-an-evolving-jupyter-notebook
https://global2022.pydata.org//cfp/talk/PB8EVG/
false
An Evolving Jupyter Notebook
Talk
en
The Jupyter ecosystem has been undergoing many changes in the past few years. While JupyterLab has been embraced by many, there are still many active users of Jupyter Notebook. With that in mind, Jupyter developers have been gearing up for the release of the updated Notebook 7 based on JupyterLab components as outlined in the [Jupyter Enhancement Proposal #79](https://jupyter.org/enhancement-proposals/79-notebook-v7/notebook-v7.html). With this, there are significant changes coming to Notebook 6, of which the upcoming Notebook 6.5 is intended to be end-of-life, and users installing Notebook will soon receive a version of the project that may disrupt their workflows. In an effort to give users time to transition to using the updated codebase, the NbClassic project has been introduced. NbClassic is the Jupyter Server extension implementation of the classical notebook. NbClassic has also become the owner of the static assets for the classical notebook, and Notebook 6.5 depends on NbClassic to provide those.
The aim of this talk is to:
1. Reflect on the changes to the Jupyter ecosystem with the introduction of NbClassic and Notebook 7.
2. Address some questions that may come up about NbClassic and Notebook 6.5, as well as some of those that may come up once Notebook 7 is released.
3. Showcase the feasibility with which users can use the different front-ends NbClassic, Notebook 7 and JupyterLab with a demo.
A demo will include a JupyterLab extension that will allow users to utilize NbClassic to view their notebooks through the classical interface.
Rosio Reyes
2022-12-01T16:30:00+00:00
16:30
00:30
Talk Track I
cfp-124-pyscript-and-data-science-a-love-story
https://global2022.pydata.org//cfp/talk/QUUNUE/
false
PyScript and Data Science: a love story
Talk
en
PyScript has brought change to the Python and PyData eco-system making it much easier to execute Python in the browser and opening the road for multiple possibilities that were not possible. The talk will explore what happened since we presented it and will talk about how PyScript can change the way we do Data Science and many other things.
It's been a few months since we first presented PyScript during PyCon 2022. It has driven a lot of enthusiasm in the community. PyScript can really change the way Data Science is delivered and democratized. By running Python directly in the browser, you can really get the experience of running Python anywhere, any time using the browser as a ubiquitous Virtual Machine.
Given the good support for the PyData stack, Data Science is the first and perhaps the most straightforward context in which PyScript can have its say. In fact, PyScript enables the creation of self-contained rich scientific apps, bringing the Scipy/PyData stack directly integrated with interactive data visualization frameworks, e.g. bokeh or panel. But there's more! PyScript also allows direct integration with Javascript, allowing the development of full-fledged applications that can now use the best tools of each other stack. But there's even more! The easy access and reduced application complexity now can really make a difference in bringing Python and Data Science to the masses.
In this talk, we're going to introduce to talk about PyScript, the latest developments and go through many examples that highlight the possibilities of a future of Python in the browser.
Fabio Pliger
2022-12-01T17:00:00+00:00
17:00
01:00
Talk Track I
cfp-299-keynote-thomas-dohmke
https://global2022.pydata.org//cfp/talk/G3KMVW/
false
Keynote - Thomas Dohmke
Keynote
en
AI is the future of software development
We are entering a new era of software development where AI will inform every aspect of the developer experience, from writing code to submitting pull requests to completing complex algorithms to deploying code and ML models together. Join GitHub CEO Thomas Dohmke as he discusses the future of software development and what AI-powered innovations are on the horizon.
/media/cfp/submissions/G3KMVW/Screen_Shot_2022-11-16_at_3.17.19_PM_zOj25RZ.png
Thomas Dohmke
2022-12-01T18:00:00+00:00
18:00
00:30
Talk Track I
cfp-206-mischief-managed-what-hackers-can-do-on-your-jupyter-instance
https://global2022.pydata.org//cfp/talk/TGTEMT/
false
Mischief Managed: What hackers can do on your Jupyter instance
Talk
en
Many Python data professionals work daily in JupyterLab or Notebook instances. What can a hacker do with access to that system? In this presentation, I will introduce the threat model and show why Jupyter instances are valuable targets. Next, I will demonstrate several post-exploitation activities that someone may try to perform on systems hosting Jupyter instances. We will conclude with some defensive strategies to minimize the likelihood and impact of these activities. This talk will help data scientists and information technology professionals better understand the perspective of potential attackers operating in Jupyter environments to improve defensive awareness and behavior.
As an offensive security researcher for artificial intelligence systems at NVIDIA, I regularly conduct offensive “red team” operations against data scientists and machine learning researchers. JupyterLab is a great development environment that enables rapid prototyping and collaboration with access to shared resources. However, it is also a tool whose security context is often poorly understood and managed. Instead of focusing on the initial “hack” or exploitation, I will demonstrate the mischief and damage that an attacker can cause after initial access (the “post exploitation” phase). All of these demonstrations will focus on configurations and documented functionality. There will be no active exploitation of software vulnerabilities (nothing requiring responsible disclosure). This way, the presentation can use the “mischief” to propose defensive awareness strategies for attendees. Attendees should be familiar with Jupyter, Python, and client/server architecture, but will not need a security background. After the presentation, data scientists and information technology leaders should be more aware of the risks introduced by JupyterLab environments. The presentation is intended to increase security awareness and build good instincts, not to instill fear or motivate a shift away from JupyterLab.
Some examples of demonstrated post-exploitation activities are documented in my blog here: https://josephtlucas.github.io/blog/content/jupyter.html and in this tweet: https://twitter.com/josephtlucas/status/1570158892163956737.
Timeline:
Minute 0-1: Introduction
Minute 2-4: Introduce JupyterLab and explore the various configurations (local, remote server, cloud offerings, etc).
Minute 5-6: Brief introduction to attack phases (define “post exploitation”). Since much of this presentation will be live demonstrations, orient the audience to the various terminal sessions (ex: attacker’s screen vs data scientist’s screen).
Minute 7-20: Demonstrations of various post-exploitation activities including: stealing user secrets, overwriting user variables, persistence mechanisms, tampering with history, poisoning imports, and others. Each demonstration will include references to relevant documentation. This functionality exists for benign purposes, attackers may just use it in unintended ways.
Minute 20-25: Present defensive recommendations and mitigation strategies.
Minute 25-30: Questions.
Joseph Lucas
2022-12-01T18:30:00+00:00
18:30
00:30
Talk Track I
cfp-283-text-to-data-make-your-code-malleable-not-brittle
https://global2022.pydata.org//cfp/talk/LUYPAE/
false
Text to Data: Make Your Code Malleable, Not Brittle
Talk
en
Extracting the highly valuable data from unstructured text often results in hard-to-read, brittle, difficult-to-maintain code. The problem is that using regular expressions directly embedded in the program control flow does not provide the best level of abstraction. We propose a query language (based on the tuple relational calculus) that facilitates data extraction. Developers can explicitly express their intent declaratively, making their code much easier to write, read, and maintain.
We allow the programmer to express what they are searching for by using higher-level concepts to express their query as tags, locations and expressions on location relations.
The location of a string of characters within the document is the interval defining its starting and ending position. Locations are grouped into sets and these sets are associated with tags. Tags can be used in conjunctions and disjunctions of interval relations to query for tuples of locations. Consider extracting the date and email from email threads of the form:
"On Wed, Oct 30, 2019 at 10:00 am Jane Doe <jane.doe@hotmail.com> wrote:"
Tags ON and WROTE can be associated with locations of strings "On" and "wrote:" in the body of the email. Tags DATE and EMAIL can be associated with dates and emails in the email body. We can find 4-tuples l1, l2, l3, l4 of locations in ON, WROTE, DATE, and EMAIL, respectively; such that each tuple satisfies the following predicate: seq_before(l1, l3, l4, l2) and distance(l1,l2)<100. Interval relation "seq_before" is satisfied if l1 < l3 < l4 < l2. Function "distance" computes the number of characters between the end of location l1 and the beginning of l2. This predicate selects 4-tuples that follow the pattern of the email thread above. To select only the date and email, we project the 4-tuple to (l3, l4).
In this talk, we will describe the query language in detail and its implementation in python. We will present a jupyter notebook with examples that illustrate how functionality changes and enhancements can be done by changing old queries or adding new queries, which makes the code much easier to maintain.
David BarrettMartha L Escobar-Molano
2022-12-01T19:00:00+00:00
19:00
01:00
Talk Track I
cfp-300-embracing-multi-lingual-data-science
https://global2022.pydata.org//cfp/talk/CRABSP/
false
Embracing multi-lingual
data science
Keynote
en
RStudio recently changed its name to Posit to reflect the fact that we're already a company that does more than just R. Come along to this talk to hear a few of the reasons that we love R, and to learn about some of the open source tools we're working on for python.
Hadley is Chief Scientist at Posit, winner of the 2019 COPSS award, and a member of the R Foundation. He builds tools (both computational and cognitive) to make data science easier, faster, and more fun. His work includes packages for data science (like the tidyverse, which includes ggplot2, dplyr, and tidyr)and principled software development (e.g. roxygen2, testthat, and pkgdown). He is also a writer, educator, and speaker promoting the use of R for data science. Learn more on his website.
/media/cfp/submissions/CRABSP/THR06018-square_UfosQEN.jpg
Hadley Wickham
2022-12-01T20:00:00+00:00
20:00
00:30
Talk Track I
cfp-258-reactive-data-processing-in-python
https://global2022.pydata.org//cfp/talk/EWZ3H7/
false
Reactive data processing in Python
Talk
en
Machine Learning models designed to work with streaming systems make decisions on new data points as they arrive. But there is a downside: model decisions can't be easily changed later when the model is updated with fresher data, user feedback, or freshly tuned hyperparameters. This is often a blocker for anomaly detection, recommender systems, process mining, and human-in-the-loop planning.
To deal with this, we'll demonstrate design patterns to easily express reactive data processing logic. We will use [Pathway](https://pathway.com), a scalable data processing framework built around a Python programming interface. Pathway is battle-tested with operational data in enterprise, including graphs and event streams in real-world supply chains, and is now launching as open-core.
You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle to it, and the steps needed to overcome these challenges.
In stream processing, Machine Learning models make decisions on new data points as soon as they arrive. Such immediate decisions are extremely useful, but not always the best. For example, when we consider anomaly detection on a stream of events, new effects or trends can usually only be detected with confidence some time after they have started. Past decisions will need to be revisited and reclassified - but which ones exactly? Stream processing does not bring a direct answer, and full batch recomputation can be extremely costly. The same problem holds across numerous contexts: recommender systems, process mining, ontology querying, human-in-the-loop planning systems,... How can you gracefully handle data and models which need revisiting with time, while not over-complicating even the simplest data transformations?
During the talk, you will learn the key engineering steps needed to deal with such problems through a reactive data processing design. Achieving such a design was our primary motivation to build [Pathway](https://pathway.com/developers). Pathway is a scalable data processing framework centered around a Python programming interface. It is deployed for processing live operational data in enterprise, including graphs and event streams in real-world supply chains, and is now becoming publicly available in an open-core model.
We will show you design patterns which allow you to easily express reactive Machine Learning logic. We will highlight where it is possible to rely on the usual Python data science stack and external libraries, and where special attention is needed. Most design patterns will feel familiar to users of Pandas or PySpark dataframes, so we will focus on the key differences - and why they are necessary to achieve efficient reactive operation.
In the course of the talk we will do a code demo, and we will show you how to create your own reactive data pipeline and microservice. The example will be a reactive app which predicts the future popularity and sentiment for trending topics in a well-known social network, across different geographies. We will fill in key steps in the code together, and then see it in action in full deployment (with source data API integration + frontend connected with FastAPI).
The talk is addressed to anyone - Machine Learning Engineers, Software Engineers, and Data Engineers - with an interest in building "smart" data pipelines and data products in a real-time or streaming setting. You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle, and the steps needed to overcome these challenges.
Adrian Kosowski
2022-12-01T20:30:00+00:00
20:30
00:30
Talk Track I
cfp-191-urdu-poems-to-shakespearean-english-machine-translation
https://global2022.pydata.org//cfp/talk/TJ9VQU/
false
Urdu poems to Shakespearean English - Machine Translation
Talk
en
All languages are rich in prose and poetry. A lot of the literature is inaccessible because of a lack of understanding of that language. It is often difficult to appreciate a simple translation of a poem due to gaps in cultural knowledge. A poem translated in the style of an author familiar to the reader might help to both add cultural context for the reader and capture the essence of the poem itself.
English is the dominant language of the web but with around 7000 languages worldwide and many more dialects, there is a need to explore literature in other languages. Rather than a simple machine translation or pivot machine translation which does not capture the nuance of the cultural context a language belongs to, is there a creative way to bridge the cultures? If you are multilingual and want there to be a space carved out for your spoken/written language, this talk will be worth your time.
I will focus on the data and methods I used but I will not go into the details of for example 'what transformers are?'. I will however explain terms like 'zero-shot translations', 'parallel corpus', etc. so the audience can follow along. My main aim is to give people an overview of what it would take for an idea like this to take off successfully and you can learn from the obstacles I am facing, to find representation for your language of interest. I hope my talk will inspire creative expression in others.
Some technical details:
The Urdu poetry to Shakespearean English translation is zero-shot, so Modern English was used as a pivot language. I fine-tuned a MarianMT model developed by Helsinki-NLP using Quran and Bible Urdu-English parallel corpus (BLEU score: 13.3). To convert Modern English into Shakespeare-Styled English, I fine-tuned a model based on a GPT2 and trained on Shakespeare’s plays to ”generate Shakespeare-like text.
Sidra Effendi
2022-12-01T21:00:00+00:00
21:00
00:30
Talk Track I
cfp-255-using-feedback-loops-to-tune-predictive-models-in-a-video-ad-marketplace
https://global2022.pydata.org//cfp/talk/7JEXSU/
false
Using feedback loops to tune predictive models in a video ad marketplace
Talk
en
For video advertisers, precisely hitting their ad performance goals is critical. Undershooting on campaign viewability objectives means spending money on ads that nobody watches, while overshooting them can mean vastly reducing the available ad slots. At JW Player, we combine predictive models with PID controllers to tune decision thresholds and deliver the maximum possible reach to our advertisers while hitting their goals.
At JW Player, we have created a thriving video advertising marketplace that empowers our publishers to monetize their video content and advertisers to identify high-quality and targeted ad opportunities. This requires balancing the trade-off between maximizing ad-engagement as well as campaign scale. We combine predictive and historical models and PID controllers to ensure that the ad opportunities we pass on precisely match our advertisers’ goals.
Here, we describe the PID controllers we use to tune the decision thresholds on our models, the advantages and limitations of this system, and the cascade of controllers we have designed to address such issues. We have many distinct PID controllers deployed, providing daily updates to the decision thresholds for several hundred million daily predictions. We discuss the deployment, performance, and monitoring of these controllers, and our plans for future improvements.
No technical knowledge is expected from the attendees - all concepts will be explained.
Emily Hopper
2022-12-01T21:30:00+00:00
21:30
00:30
Talk Track I
cfp-98-data-prep-for-graphs
https://global2022.pydata.org//cfp/talk/AH9DJD/
false
Data Prep for Graphs
Talk
en
Data science practitioners have a saying that a 80% of their time gets spent on data prep. Often this involves tools such as Pandas and Jupyter. Graph Data Science is similar, except the data prep techniques are highly specialized and computationally expensive. Moreover, data prep for graphs is required **before** commercial tools such as graph databases or visualization can be used effectively. This talk shows examples of data prep for graphs. A progressive example illustrates the challenges plus techniques that leverage open source integrations with the PyData stack: Arrow/Parquet, PSL, Ray, Keyvi, Datasketch, etc.
Graph technologies and use cases are growing in popularity in industry. Open source libraries are available for graph data science, which integration with the PyData stack and related practices. Tools such as graph databases, visualization, etc., tend to take center stage in discussions about graph technologies.
However – and this is a relatively BIG "however" – similar to what was recognized a decade ago when data science become mainstream practice, so much time and effort and cost must go into _data preparation_ long before these other tools downstream can be used effectively.
In the early-ish days of Big Data, many commercial database vendors claimed to provide full suites for data science work. Practitioners found that, in contrast, they spent more of their time working in data wrangling, often using tools such as Pandas. This has become the proverbial 80% of data science.
Graph data science is no exception to this rule. Case in point, data visualization tools can render beautiful representations from nearly raw data. Unfortunately, without careful preparation, the beautiful renderings become expensive wallpaper since they don't lead to meaningful outcomes. For example, if a large dataset contains many _cycles_ for a business process where these are undefined (e.g., supply networks) or it contains many duplicates (e.g., slight variations of vendor or author names) then we can get pretty pictures, but not meaningful analysis.
Unfortunately, data preparation techniques for graphs such _cycle detection_, _similarity analysis_, _transitive closure_, and _unique identifier assignment_ often involve graph algorithms or distributed data structures which are computationally hard problems, expensive to perform, and not supported well at scale by the commercial graph databases.
This talk shows examples of data preparation for graphs, along with an overview of typical graph use cases in industry in which these need to be used. We'll show a progressive example based on recipe data (analogous to customer data in manufacturing) along with use of the PyData stack and other open source integrations such as Ray, Keyvi, Datasketch, Arrow/Parquet, PSL, etc., which help alleviate bottlenecks at scale when working with large graphs.
/media/cfp/submissions/AH9DJD/logo_36wwYK5.png
Paco Nathan
2022-12-01T22:00:00+00:00
22:00
00:30
Talk Track I
cfp-269-the-10-commandments-of-reliable-data-science
https://global2022.pydata.org//cfp/talk/VKEWPE/
false
The 10 commandments of reliable data science
Talk
en
Data science as a professional discipline is still in its infancy, and our field lacks widespread technical norms around project organization, collaboration, and reproducibility. This is painful both for practitioners and their end users because disorganized analysis is bad analysis, and bad analysis costs money and wastes time. This talk presents ten principles for correct and reproducible data science inheriting from software engineering’s seven decades of hard-earned lessons as well as numerous experiences with data science teams at organizations of all sizes. We motivate these principles by looking at some hard truths about data science “in the wild.”
Organizations have accepted the premise that data science, done well, can be a powerful toolbox for increasing efficiency, automating expensive processes, and making better decisions. But what project decisions characterize and tend to result in “high quality” data science work products? As a field, we have yet to center on a set of engineering norms promoting organized, correct, and reproducible analysis.
Trustworthy data science work requires slightly more upfront investment, but pays immense dividends in dependability, usefulness, and ease of collaboration. This talk presents ten principles for correct and reproducible data science that inherit from software engineering’s seven decades of hard-earned lessons.
Combining lessons learned from the software world with field experience from observing numerous data science teams of all sizes and configurations, these principles provide both tactical recommendations as well as higher level ideas relevant to individual practitioners, engineering managers, and senior leaders of data organizations.
/media/cfp/submissions/VKEWPE/Het_verongelukte_KLM-toestel_De_Rijn_Bestanddeelnr_929-1005_-_cro_9oMKKmX.jpg
Isaac Slavitt
2022-12-01T22:30:00+00:00
22:30
00:30
Talk Track I
cfp-94-the-dask-at-hand-using-dask-to-speed-up-the-high-quality-transit-areas-dataset-for-the-ca-open-data-portal-
https://global2022.pydata.org//cfp/talk/D9XHMC/
false
The Dask at Hand: Using Dask to Speed up the High Quality Transit Areas dataset for the CA Open Data Portal.
Talk
en
Where are CA’s frequent, high quality transit corridors? The CA Public Resources Code defines it, but it requires continued access of the General Transit Specification Feed (GTFS) data and fairly complex geospatial processing. The Integrated Travel Project within Caltrans tackles this by leveraging the combined powers of Dask and Python to make this dataset publicly available and updated monthly on the CA open data portal.
Where are CA’s frequent, high quality transit corridors? The CA Public Resources Code defines it, but it requires continued access of the General Transit Specification Feed (GTFS) data and fairly complex geospatial processing. Luckily, the Integrated Travel Project, within Caltrans, has a pipeline of GTFS Schedule data, and is building out its pipeline of GTFS Real Time data. GTFS data is public, and it only makes sense that we make more of the GTFS data products publicly available and accessible.
GTFS provides all the details about scheduled transit service on a given day. We simply wanted to know whether certain areas exceeded the threshold of 15 min frequency bus service and also where rail, ferry, and bus rapid transit stops were, and make that available to the public. Is that too much to ask? Turns out, it required geospatial processing to slice the bus route network into equally sized 1,250 meter segments, and counting how many bus arrivals occurred on that segment each hour. The multiple stages of data processing meant we couldn’t simply dump all the GTFS contents from the warehouse into the open data portal.
At minimum, we wanted to refresh the open data portal dataset monthly, but Python alone left us strapped spending upwards of 6 hours in computation time. No can do! Even without expanding our computation resources, we leveraged Dask to cut down the computation time by more than half.
This talk is for data wonks with some Python or GIS background who aren’t afraid of wading into the details of a necessary “conceptual rewrite” to use Dask. As a recent user of Dask, I am by no means an expert, but am a learner and explorer of new tools, and will give this talk from a Dask beginner’s perspective. It will be informative, covering concepts of data processing and wrangling with a sprinkling of syntax.
The talk will cover how we interpreted and translated the statutory laws into Python code, identify the most time-consuming portions of the workflow, and highlight our business conditions and resource constraints that pushed us to embrace Dask. Specifically, it will cover how we used `dask` and `dask_geopandas` to extend our use of Python’s `pandas` and `geopandas`. It will also discuss why some limitations in `dask_geopandas` led us to a different conceptual rewrite of the code to fully utilize Dask.
Tiffany Chu
2022-12-01T08:00:00+00:00
08:00
00:30
Talk Track II
cfp-209-managing-python-dependencies-at-scale
https://global2022.pydata.org//cfp/talk/BPFCBT/
false
Managing Python Dependencies at scale
Talk
en
This talk is about the approach we've taken at the Apache Airflow for managing our dependencies at scale of a project that is the most popular Data Orchestrator in the world, consists of ~ 80 independent package and has more than 650 depenencies in total (and did not loose our sanity).
In this talk we will talk about the challenges we faced when managing Apache Airflow dependencies at scale. The maintainer who is the "dependency-maintainer" for Apache Airflow will present the solutions that allowed Airflow to survive breaking one monolithic package into anout 80 smaller ones and continue releasing Airflow for the last 4 years.
Despite a bit of a complex environment of the Python dependency world, breaking changes in PyPI and setuptools. We will tell you how he kept sanity while doing so while managing the environment, with non-stop development of Airflow through multiple releases, keeping our users happy and letting them not only consistently install but alsow allow to upgrade theor dependncies while they use Airflow but also to survive a number of severly breaking changes that our dependencies introduced - breaking multiple other packages out there.
Apache Airflow is one of the biggest projects in PyPI when it comes to dependencies. Airflow itself consists of the main "Airflow" package but additionally to that Apache Airflow releases 70+ provider packages that give optional Airflow functionality.
We regularly release - at least 20-30 packages a month (sometimes all 70+), and when you count all transitive dependencies we have way more than 500 (!) Python dependencies. It's so big that we broke `pip` after the new dependency resolver was introduced and got into an argument with PyPI maintainers, which finally led to building new friendships (and enemies) and helping PyPI become more stable and robust.
It also shows how you can manage your dependencies secure, while not spending a ton of time on testing latest security fixes - which is absolute necessary in the wake of "supply-chain security" awareness.
The story is quite fascinating (for those who are fascinated by dependency hell management that is) - and while you might not have as big of a scope as Airlfow has, you might learn a few tricks and approaches that you might find useful in your project.
Jarek Potiuk
2022-12-01T08:30:00+00:00
08:30
00:30
Talk Track II
cfp-241-arch-garch-models-tour
https://global2022.pydata.org//cfp/talk/S7T7VG/
false
ARCH/GARCH Models Tour
Talk
en
When your goal of the study is to analyze and forecast volatility, this is where the ARCH/GARCH models comes into the picture to solve the complicated time series problems.
Autoregressive conditional heteroskedasticity (ARCH) is a statistical model used to analyze volatility in time series to forecast future volatility. ARCH modeling shows that periods of high volatility are followed by more high volatility and periods of low volatility are followed by more low volatility.
Generalized Autoregressive Conditional Heteroskedasticity (GARCH) is an extension of the ARCH model that incorporates a moving average component together with the autoregressive component. GARCH models assume that the variance of the error term follows an autoregressive moving average process.
The structure of the talk will flow like this:
• Quick refresher on Time Series
• Overview of ARCH/GARCH models
• Importance of these two models, when & why
• How to configure these models
• How to estimate and forecast them
• Real time case study in Python (from loading the data to model evaluation)
#Goal
By the end of the talk, I promise you that you will get a fair understanding about ARCH/GARCH models, and you can straight away put them practice in your time series data.
Kalyan Prasad
2022-12-01T09:00:00+00:00
09:00
00:30
Talk Track II
cfp-164-data-validation-for-feature-pipelines-using-great-expectations-and-hopsworks
https://global2022.pydata.org//cfp/talk/ZEZKSQ/
false
Data Validation for Feature pipelines: Using Great Expectations and Hopsworks
Talk
en
Have you ever trained an awesome model just to have it break in production because of a null value? At its core a feature store needs to provide reliable features to data scientists to build and productionize models. So how can we avoid garbage in, garbage out situations? Great expectations is the most popular library for data validation, and so the two are a natural fit. In this talk we will touch briefly upon different Python data validation libraries such as Pydantic, Pandera but then dive deeper into Great Expectations’ concepts and how you can leverage them in feature pipelines powering a feature store.
Have you ever trained an awesome model just to have it break in production because of a null value? At its core a feature store needs to provide reliable features to data scientists to build and productionize models. So how can we avoid garbage in, garbage out situations? Great expectations is the most popular library for data validation, and so the two are a natural fit. In this talk we will touch briefly upon different Python data validation libraries such as Pydantic, Pandera but then dive deeper into Great Expectations’ concepts and how you can leverage them in feature pipelines powering a feature store.
After this talk you will …
1. Understand the tradeoffs and different uses of the three data validation libraries.
2. Understand the core concepts of great expectations and what they are for.
3. Understand the core principle of feature stores.
4. Understand how and why data validation fits into the workflow with a feature store.
5. Learn how we leverage Great Expectations in Hopsworks Feature Store to enhance data quality.
Moritz Meister
2022-12-01T10:00:00+00:00
10:00
00:30
Talk Track II
cfp-83-interpretable-and-realistic-generative-models-in-data-science-likelihood-free-bayes-says-yes-
https://global2022.pydata.org//cfp/talk/L9PMLX/
false
Interpretable and realistic generative models in data science? Likelihood-free Bayes’ says yes!
Talk
en
Are you fascinated by the real-life images or text produced by deep generative models but cannot interpret their underlying data generation process or see how they can be applied to other problems? I will talk about generative simulations built using knowledge of the problem domain that can produce realistic data in a variety of scenarios. This talk will be a Bayesian thinking exercise cum data science case study of product star rating timeseries from an online marketplace (like Amazon.com) – I will show how we use recent advances in likelihood-free Bayesian inference together with a detailed simulation of an online marketplace to directly infer factors involved in how customers purchase and rate products.
In recent years, generative models have been getting a lot of attention in the search for ML solutions that are both interpretable/explainable and robust. This is because of the hope that on the one hand, a model that generates realistic-looking datasets can also give insights into a real-world data generating process while on the other hand, such a model can be robust to outliers as it aware of the kinds of data it can or cannot generate. While deep generative models of images and text have been incredibly successful in the data generation task itself, their underlying neural network structure makes it challenging to interpret the way in which they generate data or generalize by further using them in a discriminative ML system that is also robust.
In this talk, I will talk about how realistic and detailed simulations of the problem domain might actually be the generative models we are looking for in interpretable and robust ML systems. Simulation models are common in the sciences (eg. simulations of planetary systems in astrophysics, agent-based simulations in the social sciences, neural circuit simulations in neuroscience, etc) and in engineering (eg. aerodynamic simulations in mechanical engineering, etc) and their parameters correspond directly to important causal factors in the data generating process. However, the “inverse problem” of inferring model parameters from observed data typically follows a computationally intensive grid-search like process and it is unclear how they can be used in discriminative ML tasks that dominate the world of real-life data science problems
I will go through our work on Bayesian inference in agent-based simulations of customers purchasing products on an online marketplace like Amazon.com to:
1.) Motivate the need for generative models in modern ML systems and how generative simulations have interpretability built in (~10mins).
2.) Show how recent advances in likelihood-free Bayesian inference with deep learning can infer posterior distributions over interpretable parameters at scale across the entire catalog of an online marketplace (~10mins).
3.) Use the inferred posteriors for posterior predictive simulations in rating timeseries prediction and abnormal (or fake) rating classification tasks (~10mins).
This talk will appeal to both the beginning and seasoned data science practitioner. For beginners, the first half of the talk will highlight the importance of generative models for the data science problems, show how to build such a model for an online marketplace and explain fundamental Bayesian concepts like posterior inference and posterior predictive sampling. For more experienced data scientists, the second half will go into details of likelihood-free Bayesian inference for posterior estimation and how deep probabilistic neural networks can be used for this task. Throughout the talk, I will show code snippets from our (in development) Python package to simulate and infer parameters for online marketplaces that we hope can be used by attendees directly in their own daily data science work.
Narendra Mukherjee
2022-12-01T10:30:00+00:00
10:30
00:30
Talk Track II
cfp-123-explaining-why-you-have-a-favorite-cereal
https://global2022.pydata.org//cfp/talk/WHBJL9/
false
Explaining Why You have a Favorite Cereal
Talk
en
It’s crunchy! It’s sweet! Maybe it is the presence of the nuts or their absence. There are various features that make you favor a particular cereal. Now surely, if we modeled the consumer ratings for cereals, some features would be considered more important than others. After all, feature engineering is one of the most critical steps in modeling. But after the model is up and running, what if we tweak the features just to see how much meddling can affect the preference? This process is called post-hoc feature attribution and it seeks to interpret the model behavior. In this talk, let us spoon through the interpretability of ML models.
Explainable AI (XAI) and Interpretable Machine Learning (IML) are exciting new fields that are nudging AI development towards more transparent modeling. They bring accountability and transparency to a virtual black box, that is, a trained model working for you. The model could be sitting behind a snazzy user interface or a document, or it could be a trained model that you as a data scientist would now like to enhance.
In the post-modeling aka post-hoc stage, interpretability aims at the understanding of dynamics between the input features and the output predictions, i.e., it would help understand the contribution of the features to the model predictions.
In the proposed talk, I would gently introduce you to the above concepts and some common methods of realizing them. Lastly, not to act like a cereal killer but let’s seek interpretations from a cereal rating dataset!
**My talk will focus on**
* How are explainability and interpretability different?
* Types of interpretations in IML.
* Python code to seek interpretability from a model trained on the cereals rating dataset.
Gatha
2022-12-01T11:30:00+00:00
11:30
00:30
Talk Track II
cfp-51-detecting-anomalous-sequences-using-text-processing-methods
https://global2022.pydata.org//cfp/talk/LHKSG7/
false
Detecting anomalous sequences using text processing methods
Talk
en
Hello wait you talk see to can’t all my in!
Sounds weird, right?! Detecting abnormal sequences is a common problem.
Join my talk to see how this problem involves Bert, Word2vec, and Autoencoders, and learn how you can also apply it to information security
Dealing with sequences is always an interesting project, because each item has its unique position in the sequence and there’s a connection between all items and their positions. One of the most common issues when working with sequences is dealing with anomalous sequences, that doesn’t fit with the regular sequence’s structure. Those sequences make no sense and create noise in the data and interrupt the learning process.
The most common sequences are text sentences, and possible scenario for abnormal text sequences could be when trying to translate sound to text, and sometimes there’re some irrelevant noise that is translating to nonsense sequences, and if we want to build a model based on that data, we need to find a way to identify and clean this irrelevant and anomalous data.
Detecting anomalous sequences could be also related to non-text sequences, such as sequence of action or events. Those scenarios could be related to information security problems. For example, in many organizations there’re logs of actions that has been made on internal systems and detecting suspicious sequences of actions on the system could be a crucial in detecting attacks or misusage of the systems.
In order to detect those sequences, we need to model the items in the sequence and understand its structure and connections. There’re several word embedding algorithms for generating vectors embedding out of sequences, such as word2vec, bert etc. The next step, after creating the sequence embedding, is detecting the anomalies. The algorithm we used for the anomaly detection phase is autoencoder, where you can train the model on normal data and detect the abnormal events.
This pipeline has some challenges, for example one of them is that each sequence has different length and there’s a need for training both the word embedding algorithm and the autoencoder to know how to learn the right structure of all possible lengths.
Join my talk and I’ll show you how you can process your sequences, using word embedding algorithms such as Bert and Word2vec and use their output for autoencoders in order to create an anomaly detection model for detecting suspicious sequences.
Liron Faybish
2022-12-01T12:00:00+00:00
12:00
00:30
Talk Track II
cfp-99-do-you-follow-what-i-m-explaining-a-practitioner-s-guide-to-opening-the-ai-black-box-for-humans
https://global2022.pydata.org//cfp/talk/T3N9MP/
false
Do You Follow What I’m Explaining? A Practitioner’s Guide to Opening the AI Black Box for Humans
Talk
en
Numerous tools generate "explanations" for the outputs of machine-learning models and similarly complex AI systems. However, such “explanations” are prone to misinterpretation and often fail to enable data scientists or end-users to assess and scrutinize “an AI.” We share best practices for implementing “explanations” that their human recipients understand.
Methods and techniques from the realm of artificial intelligence (AI), such as machine learning, find their way into ever more software and devices. As more people interact with these highly complex and opaque systems in their private and professional lives, there is a rising need to communicate AI-based decisions, predictions, and recommendations to their users.
So-called “interpretability” or “explainability” methods claim to allow insights into the proverbial “black boxes.” Many data scientists use tools like SHAP, LIME, or partial dependence plots in their day-to-day work to analyze and debug models.
However, as numerous studies have shown, even experienced data scientists are prone to interpret the “explanations” generated by these tools in ways that support their pre-existing beliefs. This problem becomes even more severe when “explanations” are presented to end-users in hopes of allowing them to assess and scrutinize an AI system’s output.
In this talk, we’ll explore the problem space using the example of counterfactual explanations for price estimates. Participants will learn how to employ user studies and principles from human-centric design to implement “explanations” that fulfill their purpose.
No prior data science knowledge is required to follow the talk, but a basic familiarity with the concept of minimizing an objective function will be helpful.
Kilian Kluge
2022-12-01T12:30:00+00:00
12:30
00:30
Talk Track II
cfp-95-data-storytelling-through-visualization
https://global2022.pydata.org//cfp/talk/QK7B9M/
false
Data Storytelling through Visualization
Talk
en
Data is everywhere. It is through analysis and visualization that we are able to turn data into *information* that can be used to drive better decision making. Out-of-the-box tools will allow you to create a chart, but if you want people to take action, your numbers need to tell a compelling story. Learn how elements of storytelling can be applied to data visualization.
Data is everywhere. It is through analysis and visualization that we are able to turn data into *information* that can be used to drive better decision making. Out-of-the-box tools will allow you create a chart, but if you want people to take action, your numbers need to tell a compelling story.
This talk will show, through numerous examples, how elements of storytelling can be applied to data visualization to uncover the story hidden in your data.
Additionally, we'll question how objective data visualizations really are. Seemingly small alterations to a chart, such as the title of point of comparison, may drive the viewer to wildly different conclusions. What can you do to guide viewers towards a specific (positive or negative) conclusion? Can a graph be truly neutral?
This will leave you both with a better understanding of how graphs should be interpreted, as well as the ability to better convey the meaning of your data through visualization.
Marysia Winkels
2022-12-01T13:00:00+00:00
13:00
00:30
Talk Track II
cfp-16-start-asking-your-data-why-a-gentle-introduction-to-causal-inference
https://global2022.pydata.org//cfp/talk/HSHG88/
false
Start asking your data “Why?” - A Gentle Introduction To Causal Inference
Talk
en
Correlation does not imply causation. It turns out, however, that with some simple ingenious tricks one can unveil causal relationships within standard observational data, without having to resort to expensive randomised control trials. Learn how to make the most out of your data, avoid misinterpretation pitfalls and draw more meaningful conclusions by adding causal inference to your toolbox.
Are you interested in understanding *Causal Inference* but not sure where to start? In this talk I introduce the basic concepts demonstrated in an accessible manner using visualisations as well as python scripts.
In particular I illustrate the utility of Graph Models to visualise the story behind the data which enables going beyond correlations to make data driven decisions based on causation.
You will also learn how to avoid data misinterpretation pitfalls such as Simpson’s Paradox, a situation where the outcome of a population is in conflict with that of its cohorts. This will be demonstrated using `pgmpy` as well as a `streamlit` interactive web app: `bit.ly/simpson-calculator`.
This talk is targeted to anyone, technical or managerial, that wants to improve how they make data driven decisions. No prior knowledge in python is required; basic statistics is desirable but not essential. My main message is that by adding causal thinking to your analytical toolbox you are likely to ask better questions from data and ultimately get more insights from it.
For those inclined to learn more in depth about Causal Inference, I will summarise with advice on how to climb the "causal ladder" by suggesting useful resources.
/media/cfp/submissions/HSHG88/Screenshot_2022-08-04_at_19.38.59_kDu6n05.png
Eyal Kazin
2022-12-01T13:30:00+00:00
13:30
00:30
Talk Track II
cfp-152-crowd-kit-a-scikit-learn-for-crowdsourced-annotations
https://global2022.pydata.org//cfp/talk/CCLNVD/
false
Crowd-Kit: A Scikit-Learn for Crowdsourced Annotations
Talk
en
The talk includes the presentation of Crowd-Kit - an open-source computational quality control library - followed by its demonstration.
Crowdsourced annotations in most cases require post-processing due to their heterogeneous nature; raw data contains errors, is biased and non-trivial to combine. Crowd-Kit provides various methods like aggregation, uncertainty, and agreements, which could be used as helping tools in getting an interpretable result out of data labeled with the help of crowdsourcing.
A huge amount of data for machine learning is gathered through crowdsourcing pipelines. Crowdsourcing is a useful tool when one needs to collect voluminous problem-specific datasets for training/testing/monitoring ML models in a relatively short period of time. However, aggregation of the results is untrivial when it comes to tasks other than classification, and raw results require processing to extract the real value. Crowd-Kit is a library designed to tackle crowdsourced video, audio, image, textual, and categorical data types with an interface similar to scikit-learn.
Evgeniya
2022-12-01T15:00:00+00:00
15:00
00:30
Talk Track II
cfp-37-trying-no-gil-on-scientific-programming
https://global2022.pydata.org//cfp/talk/DVEVPE/
false
Trying No GIL on Scientific Programming
Talk
en
Recently Sam Gross, the author of nogil fork on Python 3.9, demonstrates the GIL can be removed. For scientific programs which use heavy CPU-bound processes, it could be a huge performance improvement. In this talk, we will see if this is true and compare the nogil version to the original.
In this talk, we will have a look at what is no-gil Python and how it may improve the performance of some scientific calculations. First of all, we will touch upon the background knowledge of the Python GIL, what is it and why it is needed. On the contrary, why it is stopping multi-treaded CPU processes to take advantage of multi-core machines.
After that, we will have a look at no-gil Python, a folk of CPython 3.9 by Same Gross. How it provides an alternative to using Python with no GIL and demonstrates it could be the future of the newer versions of Python. With that, we will try out this version of Python in some popular yet calculation-heavy algorithms in scientific programming and data sciences e.g. PCA, clustering, categorization and data manipulation with Scikit-learn and Pandas. We will compare the performance of this no-gil version with the original standard CPython distribution.
This talk is for Pythonistas who have intermediate knowledge of Python and are interested in using Python for scientific programming or data science. It may shine some light on having a more efficient way of using Python in their tasks and interest in trying the no-gil version of Python.
Cheuk Ting Ho
2022-12-01T15:30:00+00:00
15:30
00:30
Talk Track II
cfp-205-deploying-dask
https://global2022.pydata.org//cfp/talk/9KFH9E/
false
Deploying Dask
Talk
en
Dask is a framework for parallel computing in Python.
It's great, until you need to set it up.
Kubernetes? Cloud? HPC? SSH? YARN/Hadoop even?
What's the right deployment technology to choose?
After you set it up a new set of problems arise:
- How do you install software across the cluster?
- How do you secure network access?
- How do you access secure data that needs credentials?
- How do you track who uses it and constrain costs?
- When things break, how do you track them down?
There exist solutions to these problems in open source packages like dask-kubernetes, helm charts, dask-cloudprovider, and dask-gateway, as well as commercially supported products like Coiled, Saturn, QHub, AWS EMR, and GCP Dataproc. How do we choose?
This talk describes the problem faced by people trying to deploy *any* distributed computing system, and tries to construct a framework to help them make decisions on how to deploy.
Dask is a framework for parallel computing in Python.
It's great, until you need to set it up.
Kubernetes? Cloud? HPC? SSH? YARN/Hadoop even?
What's the right deployment technology to choose?
After you set it up a new set of problems arise:
- How do you install software across the cluster?
- How do you secure network access?
- How do you access secure data that needs credentials?
- How do you track who uses it and constrain costs?
- When things break, how do you track them down?
There exist solutions to these problems in open source packages like dask-kubernetes, helm charts, dask-cloudprovider, and dask-gateway, as well as commercially supported products like Coiled, Saturn, QHub, AWS EMR, and GCP Dataproc. How do we choose?
This talk describes the problem faced by people trying to deploy *any* distributed computing system, and tries to construct a framework to help them make decisions on how to deploy.
Matthew Rocklin
2022-12-01T16:00:00+00:00
16:00
00:30
Talk Track II
cfp-247-create-text-classifiers-in-a-few-hours-using-the-open-source-no-code-label-sleuth-system
https://global2022.pydata.org//cfp/talk/DWNLKQ/
false
Create text classifiers in a few hours using the open-source, no-code Label Sleuth system
Talk
en
Domain experts often need to create text classification models; however, they may lack ML or coding expertise to do so. In this talk, we show how domain experts can create text classifiers without writing a single line of code through the open-source, no-code Label Sleuth system ([www.label-sleuth.org](https://www.label-sleuth.org)); a system that combines an intuitive labeling UI with active learning techniques and integrated model training functionality. Finally, we describe how the system can also benefit more technical users, such as data scientists, and developers, who can customize it for more advanced usage.
**What is the problem and who is the target audience?**
- Domain experts that want to create text classifiers but lack ML knowledge.
- More technical users, incl. researchers, data analysts, and developers who want to accelerate development of text classifiers.
A quick overview of current alternatives will be given.
**What is Label Sleuth?**
An open-source, interactive, no-code system for text annotation and text classifier creation. A brief tour and demo of the system will be given, showcasing the UI and interaction flow from the viewpoint of the end user.
**Underlying technologies & features**
A brief explanation of Sleuth’s internals will be given, which include:
- Intuitive text annotation UI (built using React).
- Active learning techniques to focus the domain expert’s labeling effort on the most valuable examples.
- Integrated model training functionality, which iteratively trains in the background improving versions of the model without user intervention (internally leveraging PyTorch, scikit-learn, HuggingFace transformers, spaCy, and NumPy).
- Pluggable architecture, allowing experimentation with different active learning algorithms and ML models.
**Why Label Sleuth?**
Success stories of using Label Sleuth will be provided, explaining how the system helped reduce the respective model creation effort and facilitated model creation for new audiences.
Yannis KatsisEyal Shnarch
2022-12-01T16:30:00+00:00
16:30
00:30
Talk Track II
cfp-242-on-copies-and-views-updating-pandas-internals-a-k-a-getting-rid-of-the-settingwithcopywarning-
https://global2022.pydata.org//cfp/talk/YJCYE3/
false
On copies and views: updating pandas' internals (a.k.a. “Getting rid of the SettingWithCopyWarning”)
Talk
en
Pandas’ current behavior on whether indexing returns a view or copy is confusing, even for experienced users. But it doesn’t have to be this way. We can make this aspect of pandas easier to grasp by simplifying the copy/view rules, and at the same time make pandas more memory-efficient. And get rid of the SettingWithCopyWarning.
Users of pandas probably have run into the infamous “SettingWithCopyWarning”. Several lengthy blog posts and popular stack overflow questions go into the details on what it is and how to deal with it. At the core of this, pandas’ current behavior on whether indexing returns a view or copy is confusing. Pandas’ internals will, for most users, be kind of a black box, and it is hard to reason about how the column’s memory is stored. Even for experienced users, it’s hard to tell whether a view or copy will be returned.
But it doesn’t have to be this way. We can simplify the rules and let any indexing operation or method that returns a new DataFrame always behave as it is a copy (and thus never modifies the original DataFrame when itself being mutated). Using the concept of copy-on-write, we can make this aspect of pandas easier to grasp, and at the same time make pandas more memory-efficient.
In this talk, I will give a brief background on the current internals of pandas related to copies and views and why we have the SettingWithCopyWarning. Then, I will explain the proposal to greatly simplify the rules around copy and view semantics in pandas, and how we can get rid of the SettingWithCopyWarning.
Joris Van den Bossche
2022-12-01T18:00:00+00:00
18:00
00:30
Talk Track II
cfp-87-parallelization-of-code-in-python-for-beginners
https://global2022.pydata.org//cfp/talk/XTR833/
false
Parallelization of code in Python for beginners
Talk
en
Stuck with long-running code that takes too long to complete, if ever? Learn to think strategically about parallelizing your workflows, including the characteristics that make a workflow a good candidate for parallelization as well as the options in python for executing parallelization. The talk eschews PySpark or other big data platforms.
Data scientists and engineers manipulating data in python can run into performance issues with executing calculations in sequence across a large dataset. One may desire quicker runtime to facilitate rapid prototyping, or may face compute resource shortages when processing a single large chunk of data all at once. This conceptually-oriented talk offers solutions to speed up slow-running workflows, as well as unblock constraints around computational resource capacity.
We explain the benefits of parallelized computing methods from a beginner’s perspective. We describe computing patterns for parallel computing, as well as how these match to practitioners’ workflows. Finally, we tie this in with working examples using the python package joblib, to show how this tool can be used to facilitate parallel computing. Along the way, we highlight typical examples matching problem to solution that should resonate with data scientists and engineers.
The goal of this talk is to empower relative beginners with fundamental knowledge about basic parallel computing concepts that can enable them to be more productive, as well as to understand why their code performance degrades and what levers exist to tune performance.
This talk is appropriate for analytics practitioners who have medium to large data needs, but that don’t want to deal with big data platforms just yet.
Cheryl Roberts
2022-12-01T18:30:00+00:00
18:30
00:30
Talk Track II
cfp-91-single-node-shared-memory-comes-to-dask
https://global2022.pydata.org//cfp/talk/JSXXGE/
false
Single node shared memory comes to dask
Talk
en
The Ray project has show that having a shared memory facility greatly helps in certain compute problems, particularly where the job can be performed on a single large machine as opposed to a cluster. We present preliminary work showing that Dask can also achieve the same benefits.
Because of the global interpreter lock in python, it is not possible to parallelise all jobs across the threads of a single process. This is a real shame, because within a process, objects can be shared between threads by reference, without copy. So a typical dask cluster will consist of a number of processes with a few threads each, even on a large single node system with enough global memory for the problem at hand. This means expensive duplication of objects across processes, wasting memory resources and CPU time in serialisation and interprocess communication. As very large machines are as easy to obtain as clusters of smaller ones in today's cloud computing environments, Dask needs to better service such large single-node jobs by providing shared memory.
One of the strengths of the Ray compute model, is the ability to use posix shared memory across processes, via the apache arrow plasma library. Although it adds complexity the the system, the savings on memory usage , serialisation and communication are significant enough that this is seen as one of the real advantages of Ray.
In this talk, we will present prototype shared memory solutions for dask. We will demonstrate potential backends: lmbd, plasma, vineyard and multiprocessing.SharedMemory, and show that we really do achieve the memory use and efficiency savings envisioned. We will compare the performance and features of the two proposed backends.
Finally, we will look ahead to the possible extension to shared memory on multi-node systems, and possible smart spilling that a central memory manager makes possible.
Martin Durant
2022-12-01T20:00:00+00:00
20:00
00:30
Talk Track II
cfp-305-running-apache-airflow-at-scale
https://global2022.pydata.org//cfp/talk/J7NHFJ/
false
Running Apache Airflow at Scale
Talk
en
Apache Airflow is a foundational component of data platform orchestration at Shopify. In this talk, we'll dive into the many performance and reliability challenges we’ve encountered running Airflow at Shopify’s scale, our custom tooling, and the new multi-instance architecture we rolled out.
Along the way we'll share our tips and lessons learned so you can run Airflow at scale, too.
Jean-Martin ArcherMichael Petro
2022-12-01T20:30:00+00:00
20:30
01:30
Talk Track II
cfp-306-lightning-talks
https://global2022.pydata.org//cfp/talk/AEWSCP/
false
Lightning Talks
Lightning Talk
en
<b>Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.</b>
<b>Order of Presentations</b>
1. OpenTeams Score: A Way to Assess Open Source Projects by Fatma Tarlaci, Brian Skinn, Dale Tovar
2. k-NN on steroids - an introduction to approximate nearest neighbours by Kacper Łukawski
3. Introducing Meadowrun: write code locally, run on the cloud by Kurt Schelfthout, Richard Lee
4. It might look normal but this distribution will ruin your stats by Allan Campopiano
5. Everything That You Wanted To Know About P-Values But Were Afraid To Ask by Eyal Kazin
6. Quokka: Rewriting SparkSQL in Python by Ziheng Wang
7. Lessons learned from using the C Foreign Function Interface to integrate neural networks into Earth system models by Caroline Arnold
Brian SkinnKacper ŁukawskiKurt SchelfthoutRichard LeeAllan CampopianoEyal KazinZiheng WangCaroline Arnold
2022-12-01T22:00:00+00:00
22:00
00:30
Talk Track II
cfp-313-revolutionizing-the-big-data-age-with-compute-over-data
https://global2022.pydata.org//cfp/talk/F9AQCV/
false
Revolutionizing the Big Data Age With Compute over Data
Talk
en
Introducing a new project, Compute over Data (Bacalhau), to run any computation on decentralized data. No need to move large datasets & all languages/data are supported. If you can run Docker/WASM, you're in the game!
Bacalhau is a decentralized public computation network that takes a job and moves it near where the data stored, including across a decentralized server network that stores data and runs jobs inside it. Bacalhau runs the job near where data lives and eliminates data management for the user.
speaker: David Aronchick
David Aronchick
2022-12-01T22:30:00+00:00
22:30
00:30
Talk Track II
cfp-53-object-detection-with-kerascv
https://global2022.pydata.org//cfp/talk/HH7MMC/
false
Object Detection with KerasCV
Talk
en
KerasCV offers a complete set of APIs to train your own state-of-the-art,
production-grade object detection model. These APIs include object detection specific
data augmentation techniques, models, and COCO metrics.
This talk covers how to train a RetinaNet on your own dataset using KerasCV
KerasCV offers a complete set of APIs to train your own state-of-the-art,
production-grade object detection model. These APIs include object detection specific
data augmentation techniques, models, and COCO metrics.
This talk covers how to train a RetinaNet on your own dataset using KerasCV
Lucas Wood
2022-12-01T09:30:00+00:00
09:30
01:30
Workshop/Tutorial I
cfp-138-full-stack-machine-learning-for-data-scientists
https://global2022.pydata.org//cfp/talk/CLKMWR/
false
Full-stack Machine Learning for Data Scientists
Tutorial
en
One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present a hands-on introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. Participants will learn how to take common machine learning models, such as those from scikit-learn, XGBoost, and Keras, and productionize them using Metaflow.
We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure.
You can find the companion repository for the workshop here: https://github.com/outerbounds/full-stack-ML-metaflow-tutorial.
One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present a hands-on introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. Participants will learn how to take common machine learning models, such as those from scikit-learn, XGBoost, and Keras, and productionize them using Metaflow.
We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure.
You can find the companion repository for the workshop here: https://github.com/outerbounds/full-stack-ML-metaflow-tutorial
Session Outline
Lesson 1: Machine learning workflows and DAGs
This lesson will focus on building local machine learning workflows using Metaflow, although the high-level concepts taught will be applicable to any workflow orchestrator. Attendees will get a feel for writing flows and DAGs to define the steps in their workflows. We’ll also use DAG cards to visualize our ML workflows. This lesson will be local computation and in the next lesson, we’ll burst to the cloud.
Lesson 2: Bursting to the Cloud
In this lesson, we’ll see how we can move ML steps or entire workflows to the cloud from the comfort of our own IDE. In this case, we’ll be using AWS Batch compute resources, but the techniques are generalizable.
Lesson 3 (optional and time permitting): Integrating other tools into your ML pipelines
We’ll also see how to begin integrating other tools into our pipelines, such as dbt for data transformation, great expectations for data validation, Weights & Biases for experiment tracking, and Amazon Sagemaker for model deployment. Once again, the intention is not to tie us to any of these tools, but to use them to illustrate various aspects of the ML stack and to develop a workflow in which they can easily be switched out for other tools, depending on where you work and who you’re collaborating with.
Hugo Bowne-Anderson
2022-12-01T11:30:00+00:00
11:30
01:30
Workshop/Tutorial I
cfp-60-sktime-python-toolbox-for-time-series-pipelines-and-transformers
https://global2022.pydata.org//cfp/talk/3TXUMK/
false
sktime - python toolbox for time series: pipelines and transformers
Tutorial
en
sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack. sktime has a rich framework for building pipelines across multiple learning tasks that it supports, including forecasting, time series classification, regression, clustering. This tutorial explains basic and advanced sktime pipeline constructs, and introduces in detail the time series transformer which is the main component in all types of pipelines. It is a continuation of the sktime introductory tutorial at pydata global 2021.
In time series analysis, often multiple, sometimes repetitive, algorithmic steps are applied to the data. Organising these steps in a clear way to enable flexible deployment on multiple data sets and easily reproduce results. Pipelines offer a solution to this challenge by providing a structure to build flexible sequences of applying time series algorithms. The modular building blocks of pipelines are "transformers" or "transformations" (in the scikit-learn sense) as well as estimators specific to learning tasks, such as forecasters or time series classifiers. The challenge in learning with time series are the many different types of transformations, such as:
* transformers of a time series to time series, e.g., differencing and detrending
* transformers of a time series to a row of primitive features/valus in a data frame, e.g., time series summary
* transformers of a time series to a panel of time series, e.g., bootstrap, sliding window
* transformers that apply to hierarchical time series, e.g., reconciliation or hierarchical aggregation
* transformers of a pair of time series to a real number, e.g., time series distances or kernels
sktime provides a framework to distinguish the above, and to use transformers of the various types as components in different types of pipelines, such as:
* forecasting pipelines, with transformers applied to endogeneous, exogeneous, or output data,
* time series classification pipelines, with transformers applied to inputs,
* compositor pipelines for time series distances or parameter estimators,
* specialized reduction steps consuming different types of transformers and machine learning estimators,
* and many more.
The design challenge is to formalize transformers in a way that a given type of transformer can be used in multiple types of pipeline, and creating pipelines that can use multipe types of transformers. sktime solves this challenge through the "scientific type" formalism which applies object orientation based typing to the transformers and inputs/outputs. The presentation will also briefly touch on advanced pipelining concepts such as graph pipelines and roadmap items inviting contributions.
Franz KiralyBenedikt HeidrichMirae L ParkerMartin Walter
2022-12-01T15:00:00+00:00
15:00
01:30
Workshop/Tutorial I
cfp-41-bayesian-decision-analysis
https://global2022.pydata.org//cfp/talk/LRRXLV/
false
Bayesian Decision Analysis
Tutorial
en
This tutorial is a hands-on introduction to Bayesian Decision Analysis (BDA), which is a framework for using probability to guide decision-making under uncertainty. I start with Bayes's Theorem, which is the foundation of Bayesian statistics, and work toward the Bayesian bandit strategy, which is used for A/B testing, medical tests, and related applications. For each step, I provide a Jupyter notebook where you can run Python code and work on exercises. In addition to the bandit strategy, I summarize two other applications of BDA, optimal bidding and deriving a decision rule. Finally, I suggest resources you can use to learn more.
Outline
* Problem statement: A/B testing, medical tests, and the Bayesian bandit problem
* Prerequisites and goals
* Bayes's theorem and the five urn problem
* Using Pandas to represent a PMF
* Notebook 1: Estimating proportions
* From belief to strategy: Thompson sampling
* Notebook 2: Implementing and testing Thompson sampling
* Debrief: why Bayesian decision analysis is better
* More generally: two other examples of BDA
* Resources and next steps
Prerequisites
For this tutorial, you should be familiar with Python at an intermediate level. We'll use NumPy, SciPy, and Pandas, but I'll explain what you need to know as we go. You should be familiar with basic probability, but you don't need to know anything about Bayesian statistics.
I'll provide Jupyter notebooks that run on Colab, so you don't have to install anything or prepare ahead of time. But you should be familiar with Jupyter notebooks.
/media/cfp/submissions/LRRXLV/Screenshot_at_2022-08-23_18-15-36_bIolgHg.png
Allen Downey
2022-12-01T11:30:00+00:00
11:30
01:30
Workshop/Tutorial II
cfp-200-building-data-products-in-a-lakehouse-using-trino-dbt-and-dagster
https://global2022.pydata.org//cfp/talk/GQSGAD/
false
Building Data Products in a Lakehouse using Trino, dbt, and Dagster
Tutorial
en
Build data pipelines using Trino and dbt, combining heterogeneous data sources without having to copy everything into a single system. Manage access to your data products using modern and flexible security principles from authentication methods to fine-grained access control. Run and monitor your data pipelines using Dagster.
Data engineers struggle with enormous amounts of data, lack of management, governance and transparency. The variety of data sources and formats can be overwhelming and require working with many different tools. This tutorial shows how to use open source technologies to resolve common data problems in organizations. Using Trino/Starburst as a Data Mesh platform could help businesses to standardize processes and have a single platform providing a holistic view over different data sources, like NoSQL, SQL and data lakes.
Data pipelines over different data sources without having to move the data into one single system using standard SQL.
Breakdown:
- Intro to Trino
- Architecture
- Data federation
- Demonstrate using Docker
- Business case
- ecommerce
- Building a data pipeline spanning multiple data sources using dbt and Trino
- Build a simple data product using dbt
- Incremental loads (snapshot and merge)
- Make your data pipeline observable using Dagster
- Model your pipeline in Dagster
- Run your pipeline
- Monitor your pipeline
Przemysław DenkiewiczMichiel De Smet
2022-12-01T15:00:00+00:00
15:00
01:30
Workshop/Tutorial II
cfp-270-anomaly-detection-on-streaming-data-in-python-using-bytewax-and-river
https://global2022.pydata.org//cfp/talk/WQJPKJ/
false
Anomaly Detection on Streaming Data in Python using Bytewax and River
Tutorial
en
Bytewax is an open source, Python native, framework and distributed processing engine for processing data streams that makes it easy to build everything from pipelines for anonymizing data to more sophisticated systems for fraud detection, personalization, and more. For this tutorial, we will cover how you can use Bytewax and the Python library, River, to build an online machine learning system that will detect anomalies in IoT data from streaming systems like Kafka and Redpanda. This tutorial is for data scientists, data engineers, and machine learning engineers interested in machine learning and streaming data. At the end of the tutorial session you will know how to:
- run a streaming platform like Kafka or Redpanda in a docker container,
- develop a Bytewax dataflow
- run a River anomaly detection algorithm to detect anomalous data
The tutorial material will be available via a GitHub Repo and the content will be covered in roughly the timeline shown below.
- 0-10min - Introduction to stream processing and online machine learning
- 10-30min - Setup streaming system and prepare the data
- 30-60min - Write the Bytewax dataflow and anomaly detector code
- 60-90min - Tune the anomaly detector and run the Bytewax dataflow successfully.
The proliferation of connected devices, from smart appliances to connected cars, has created a landslide of data. Detecting whether or not these (sometimes) massive networks of connected devices are functioning properly in real time can be beyond the capability of humans. In order to analyze whether there is a problem is not only a problem of volume but also of changing environmental variables.
To help us build systems that can scale to the volume of data and that can handle the changing environments, we can use two Python tools - Bytewax and River. Bytewax is a stateful data processing framework and engine and will allow us to scale our processing to meet the volume requirements through parallelization. River is a Python library focused on online machine learning where the model is updated incrementally and stored in state. The algorithms used are particularly well suited for dynamic environments.
In this tutorial session, you will get a better understanding of how you can use online machine learning algorithms to detect anomalies across hundreds of sensors. This session will guide you through how to set up a development environment with a streaming system (Kafka or similar), load sensor data to the streaming system with Bytewax, and write a dataflow that will transform the data and use different anomaly detection algorithms to determine if there are anomalies in the sensor data.
/media/cfp/submissions/WQJPKJ/icon_transparent_tfHJrGg.png
Zander
2022-12-01T15:00:00+00:00
15:00
01:00
Community Events & Sponsor Sessions
cfp-312-executives-at-pydata
https://global2022.pydata.org//cfp/talk/DP7GJC/
false
Executives at PyData
Talk
en
Executives at PyData is a facilitated discussion session for leaders on the challenges around designing and delivering successful projects, organizational communication, product management and design, hiring, and team growth.
<b>[Join here](https://numfocus-org.zoom.us/j/81173613104?pwd=R1ZveFNkRit3ZnFDWkdlR1FGZFJEdz09)</b>
This 2-hour discussion session will answer a set of critical questions in the first hour and then we’ll switch to an open discussion with invited guest Douglas Squirrel for our second hour.
The recording of the sessions will be public (just like all the conference videos).
A write-up of the key points, tools and processes will be made available after the conference to all attendees, this will be a valuable take-from from the conference for all attendees.
Organized by Ian Ozsvald (London) and Lauren Oldja (New York).
Douglas Squirrel – coach and consultant to tech teams, helping make tech teams insanely profitable since 2001
Ian Ozsvald – Chief Data Scientist, strategic advisor to data science teams, co-founder of the PyDataLondon meetup and conference series, author of the Successful Data Science Projects course and PyData conference talks, author of O’Reilly’s High-Performance Python (2nd ed)
Lauren Oldja – PyData NYC Executive Chair, Principal Data Scientist, experienced manager of multi-stakeholder, multi-country technical projects
Ian OzsvaldLauren OldjaDouglas Squirrel
2022-12-01T16:00:00+00:00
16:00
00:30
Community Events & Sponsor Sessions
cfp-314-openteams-ama-with-travis-oliphant-lalitha-krishnamoorthy-fatma-tarlaci
https://global2022.pydata.org//cfp/talk/NFQFLS/
false
OpenTeams’ AMA with Travis Oliphant, Lalitha Krishnamoorthy & Fatma Tarlaci
Talk
en
We’re on a global mission to make open source software thrive and be more sustainable—from supporting open source contributors in their career paths with our Open Source Professional Network (OSPN) to helping organizations transform their business with support from our vetted network of enterprise solution architects (ESA Network) to helping our clients select the right open source software stack for their business challenge by leveraging our AI-driven scoring system. Please join us during Sponsor Open Hours to learn more and ask us anything about open source.
With Travis Oliphant, Lalitha Krishnamoorthy & Fatma Tarlaci
<b>Join here</b> - https://numfocus-org.zoom.us/j/88281360503?pwd=cFdsaGd4N3FoQWRDdlJHZmxSM0JaUT09
2022-12-01T20:30:00+00:00
20:30
01:00
Community Events & Sponsor Sessions
cfp-316-apache-airflow-at-scale-let-s-discuss
https://global2022.pydata.org//cfp/talk/RDJWH8/
false
Apache Airflow at Scale: Let's Discuss
Talk
en
Apache Airflow is a foundational component of data platform orchestration at Shopify. Following the main talk, this is a session is scheduled for you to ask and discuss running Airflow at scale with Jean-Martin Archer, Staff Data Engineer at Shopify and Michael Petro, Data Engineer at Shopify
Join the team from Shopify for this open discussion.
Jean-Martin ArcherMichael Petro
2022-12-02T08:00:00+00:00
08:00
00:30
Talk Track I
cfp-103-bastionai-towards-an-easy-to-use-privacy-preserving-deep-learning-framework
https://global2022.pydata.org//cfp/talk/CSPMJD/
false
BastionAI: Towards an Easy-to-use Privacy-preserving Deep Learning Framework
Talk
en
We present BastionAI, a new framework for privacy-preserving deep learning leveraging secure enclaves and Differential Privacy.
We provide promising first results on fine-tuning a BERT model on the SMS Spam Collection Data Set within a secure enclave with Differential Privacy.
The library is available at https://github.com/mithril-security/bastionai.
Confidential training of deep learning models on sensitive data, possibly between multiple data owners such as hospitals or banks, is still a major milestone toward the adoption of AI in critical industries. Federated Learning approaches have been proposed but hardware and software deployment complexity on each node, communication cost, and high level of Differential Privacy noise (if any) in decentralized setups make them difficult to adopt in practice.
We present BastionAI, a new framework for privacy preserving deep learning leveraging secure enclaves and Differential Privacy. The library departs from traditional Decentralized Federated learning and proposes a Fortified Learning approach, where computations are centralized in a Trusted Execution Environment. This allows for faster training, reasonable Differential Privacy noise for the same budget, and simplifies deployment as each participating node only needs a lightweight interface to check the security features of the remote enclave.
We provide promising first results on fine-tuning a BERT model on the SMS Spam Collection Data Set within a secure enclave with Differential Privacy.
The library is available at https://github.com/mithril-security/bastionai.
/media/cfp/submissions/CSPMJD/BastionAI_-_Sch%C3%A9mas_zZnkLT0.jpg
Daniel Huynh
2022-12-02T08:30:00+00:00
08:30
00:30
Talk Track I
cfp-44-ml-in-production-what-does-production-even-mean-
https://global2022.pydata.org//cfp/talk/XTYXRG/
false
ML in Production – What does “Production” even mean?
Talk
en
We like talking about production – one famous, but probably wrong statement about it is “87% of data science projects never make it to production”.
While giving a talk to a group of up-and-coming data scientists, a question that surprised me came up:
**When you say “production”, what exactly do you mean?**
Buzzwords are great, but all the cool kids know what production is, right? Wrong.
In this talk, we’ll define what production actually means. I’ll present a first-principles, step-by-step approach to thinking about deploying a model to production. We’ll talk about challenges you might face in each step, and provide further reading if you want to dive deeper into each one.
This talk will cover the following topics:
- Defining ML in production from first principles
- What types of production that aren’t deployment exist
- The difference between model deployment and pipeline deployment
- Explaining the case we’ll focus on - Deploying a single model to production, which receives data, makes a prediction and returns that prediction to an accessible location.
- “Easy” deployment solutions (Streamlit, Gradio) and their limitations
- Breakdown of stages:
1. Creating code that takes a trained model, receives data, predicts and returns the prediction
2. Wrap that in an API, which can receive the data via some request (HTTP), and returns the prediction in some standard format (JSON)
1. Caveats for authentication
3. Put that API in an environment that enables is to be portable and run across hardware types (e.g. Docker)
4. Provide infrastructure, via the cloud, and run the container to listen for requests
1. Caveats for GPUs
- Further reading
Dean Pleban
2022-12-02T09:00:00+00:00
09:00
00:30
Talk Track I
cfp-10-don-t-stop-til-you-get-enough-hypothesis-testing-stop-criterion-with-precision-is-the-goal-
https://global2022.pydata.org//cfp/talk/CW9ZZX/
false
Don't Stop 'til You Get Enough - Hypothesis Testing Stop Criterion with “Precision Is The Goal”
Talk
en
In hypothesis testing the stopping criterion for data collection is a non-trivial question that puzzles many analysts. This is especially true with sequential testing where demands for quick results may lead to biassed ones.
I show how the belief that Bayesian approaches magically resolve this issue is misleading and how to obtain reliable outcomes by focusing on sample precision as a goal.
Hypothesis testing may come off as a dark art. On the one hand, data collection is expensive. On the other, small data sets may not yield enough statistical significance to draw meaningful conclusions. Combining these constraints with stakeholder requirements for quick answers from data makes the task of choosing the sample size stopping criterion a challenging balancing act.
This is especially true if the data is collected in a sequential manner, where a person, or an algorithm needs to determine when to stop collecting data to satisfy the project requirements without introducing confirmation bias.
This talk is targeted to anyone involved in experimentation, technical or managerial, and is interested in improving how they plan an experiment budget and conduct post data collection interpretation. A basic understanding of statistics and hypothesis testing experience are nice-to-haves but not essential as I will outline the basics.
In this talk you will learn why even though Bayesian approaches are more reliable than Frequentist ones for small data sets they do not magically solve the problem of confirmation bias. This will be followed with an introduction to John Kruschke's “Precision is the Goal” method, where by determining in advance the experiment expected precision level yields robust results.
/media/cfp/submissions/CW9ZZX/Screenshot_2022-07-29_at_13.34.09_WmuKmaF.png
Eyal Kazin
2022-12-02T09:30:00+00:00
09:30
00:30
Talk Track I
cfp-129-supercharging-your-pandas-workflows-with-modin
https://global2022.pydata.org//cfp/talk/3GZMQP/
false
Supercharging your pandas workflows with Modin
Talk
en
Data practitioners are typically forced to choose between tools that are either easy to use (pandas) or highly scalable (Spark, SQL..etc.). Modin, an open source project originally developed by researchers at UC Berkeley, is a highly scalable, drop-in replacement for pandas.
This talk will give an overview of Modin and practical examples on how to use it to effortlessly scale up your pandas workflows.
pandas is one of the most popular data science libraries, with between 5-10M users, used by data scientists to clean, analyze, featurize, explore, transform, and model data. However, pandas breaks down at scale, making it difficult for end users to move beyond small, toy datasets and generalize their insights.
Modin is a highly scalable, drop-in replacement for pandas. The open source project has been downloaded 4 million times and is in use by teams in the Fortune 500 as well as high growth technology companies. Grounded in years of research and development at UC Berkeley’s RISE lab, Modin eliminates the complexity of working directly with distributed systems and lets users continue to use the pandas syntax at massive scale.
This talk will give you an overview of Modin and walk you through practical examples so you can try it yourself. Our demo will explore the use of Modin in conjunction with the popular HuggingFace NLP Transformer library.
To learn more about the project visit https://github.com/modin-project/modin
Alejandro Herrera
2022-12-02T10:00:00+00:00
10:00
00:30
Talk Track I
cfp-90-why-we-do-ml-model-retraining-wrong-and-how-to-do-better
https://global2022.pydata.org//cfp/talk/CBK8RJ/
false
Why we do ML model retraining wrong, and how to do better
Talk
en
Machine learning models degrade with time. You need to update and retrain them regularly. However, the decision on the maintenance approach is often arbitrary, and the models are simply retrained on a schedule or after every new batch. This can lead to suboptimal performance or wasted resources. In this talk, I will discuss how we can do better: from estimating the speed of the model decay in advance to constructing a proper evaluation set.
Once you create a machine learning model and put it into production, the work does not stop. The model quality might degrade in time, and you need to keep an eye on it and retrain or update the models accordingly.
However, data scientists often do not give much thought to this maintenance process. Some models are never updated, while others are updated on an arbitrary schedule or after every new batch of data arrives. Each approach has its pitfalls. You might keep an underperforming model in production without knowing it, leave potential for model improvement on the table, or waste the resources on an unnecessary update.
Based on the experience of deploying and maintaining ML models in production, I will discuss different approaches to model retraining and how we can improve:
* Schedule-based retraining. How to estimate the optimal retraining cadence in advance by performing experiments on the training data.
* Trigger-based retraining. What is wrong with blind model retraining once you detect drift, and how to access if your new data batch is good enough.
* Model retraining process. How to decide between model retraining and an update, whether you should drop the old data, and how to construct a proper evaluation set.
Emeli Dral
2022-12-02T10:30:00+00:00
10:30
00:30
Talk Track I
cfp-105-super-search-with-opensearch-and-python
https://global2022.pydata.org//cfp/talk/YQUZFW/
false
Super Search with OpenSearch and Python
Talk
en
OpenSearch is an open source document database with search and aggregation superpowers, based on Elasticsearch. This session covers how to use OpenSearch to perform both simple and advanced searches on semi-structured data such as a product database.
OpenSearch is an open source document database with search and aggregation superpowers, based on Elasticsearch. This session covers how to use OpenSearch to perform both simple and advanced searches on semi-structured data such as a product database. Search is pretty useful inside applications, so we'll also discuss how to connect to OpenSearch from existing Python applications, work with data in the database, and perform search and aggregation queries from Python. This talk is recommended for Python developers whose applications are ready to gain some search superpowers.
Duration: 30 minutes
/media/cfp/submissions/YQUZFW/python-opensearch-dive-in_0Tjxvhr.png
Laysa Uchoa
2022-12-02T11:00:00+00:00
11:00
00:30
Talk Track I
cfp-127-how-to-maximally-parallelize-the-entire-pandas-api
https://global2022.pydata.org//cfp/talk/XKTAWW/
false
How to maximally parallelize the entire pandas API
Talk
en
pandas has rapidly become one of the most popular tools for data analysis, but is limited by its inability to scale to large datasets. We developed Modin, a scalable, drop-in alternative to pandas, that preserves the dynamic and flexible behavior of pandas dataframes while enhancing the scalability.
This talk will walk you through our team’s research at UC Berkeley, which enabled the development of Modin. We’ll also discuss our latest publication at VLDB, which covers a novel approach to parallelization and metadata management techniques for dataframes.
pandas has rapidly become the tool of choice for most data scientists - its flexibility and compatibility with many other data science libraries enables rapid prototyping and iteration. In production; however, pandas often falls flat due to its inability to scale - pandas is single threaded, and only operates ion memory. This introduces a barrier in the data science workflow: data scientists prefer to use pandas to author their data science workflows; however, to run the same workflows at scale, they need to be rewritten to use more traditional data analytic tools, like Spark or SQL. We have been developing Modin, a scalable drop-in replacement for pandas that allows users to operate on data at scale, without having to rewrite their pandas workflows.
However, pandas’ extensive API features around 600 operators, making it difficult to optimize at scale. To address the scaling challenges, we draw inspiration from relational databases to develop a dataframe algebra, which can be composed to express any pandas query. By designing an extensible translation layer from the pandas API to our dataframe algebra, we enable Modin to work on distributed data, as well as optimize queries to reduce latency. This modular architecture, combined with Modin’s decomposition rules for dataframes, and its metadata independence, allow it to deliver performance at scale.
This talk will discuss the architecture of Modin, as well as delve into the key research insights developed at UC Berkeley that enabled its creation. Systems Researchers and Data science practitioners can expect to learn about the core design principles underlying Modin, as well as get a deep dive into Modin as a system.
Rehan Durrani
2022-12-02T11:30:00+00:00
11:30
00:30
Talk Track I
cfp-82-exploring-feature-redundancy-and-synergy-with-facet-2-0-and-why-you-need-it-to-interpret-ml-models-correctly
https://global2022.pydata.org//cfp/talk/M99HFZ/
false
Exploring Feature Redundancy and Synergy with FACET 2.0 - and Why You Need It to Interpret ML Models Correctly
Talk
en
Understanding dependencies between features is crucial in the process of developing and interpreting black-box ML models. Mistreating or neglecting this aspect can lead to incorrect conclusions and, consequentially, sub-optimal or wrong decisions leading to financial losses or other undesired outcomes. Many common approaches to explain ML models – as simple as feature importance or more advanced methods such as SHAP – can yield misleading results if mutual feature dependencies are not taken into account.
In this talk we present FACET 2.0 - a new approach for global feature explanations using a new technique called SHAP vector projection, open-sourced at: https://github.com/BCG-Gamma/facet/.
Common black-box ML models do not provide explicit analysis of feature dependencies. This aspect of model interpretation is often neglected, which can lead to incorrect conclusions about the true contributions of features to individual predictions, or to the model as a whole.
FACET is an open-source ML library, developed by BGC Gamma with v2.0 about to be released with major enhancements. FACET uses a novel algorithm for global explanations of feature dependencies, addressing an important gap in existing approaches for black-box explanation methods. It introduces a measure of „redundancy” (how much information present in a feature is repeated in the other ones) and „synergy” (how much a given feature gains in predictive power when combined with other features).
Moreover, FACET is based on sklearndf, a library enhancing scikit-learn for full support of pandas DataFrames and keeping track of feature names across even complex ML pipelines.
/media/cfp/submissions/M99HFZ/GAMMA_BCGX_Logo_8g47kGq.png
Mateusz SokółJan Ittner
2022-12-02T12:00:00+00:00
12:00
00:30
Talk Track I
cfp-296-machine-learning-frameworks-interoperability
https://global2022.pydata.org//cfp/talk/3PFAEF/
false
Machine Learning Frameworks Interoperability
Talk
en
To develop mature data science, machine learning, and deep learning applications, one must develop a large number of pipeline components, such as data loading, feature extraction, and frequently a multitude of machine learning models.
The complexity of those components frequently requires using a broad range of software components/tools, creating challenges during pipeline integration. We'll discuss zero-copy functionality across several GPU-accelerated and non-GPU-accelerated data science frameworks, including, among others, PyTorch, TensorFlow, Pandas, SciKit Learn, RAPIDS, CuPy, Numba, and JAX. Zero-copy avoids unnecessary data transfers, hence drastically reducing the execution time of your application. We'll also address memory layouts of the associated data objects in various frameworks, the efficient conversion of data objects using zero-copy, as well as using a joint memory pool when mixing frameworks.
Christian HundtMiguel Martínez
2022-12-02T12:30:00+00:00
12:30
00:30
Talk Track I
cfp-85-lessons-learned-building-our-own-dashboard-solution-using-open-source-technologies
https://global2022.pydata.org//cfp/talk/DWNRP9/
false
Lessons Learned Building Our Own Dashboard Solution Using Open-Source Technologies
Talk
en
Most organisations habe implemented some kind of dashboard to monitor their data, processes, or business. However, many dashboard solutions come with a caveat – either the licensing costs, lack of transparency in the workflows, limited creativity, or they cannot be connected to existing infrastructure.
This talk is aimed at Data Scientists, Data Engineers, Data Practitioners and Managers struggling with choosing between a myriad of commercial dashboard solutions and DIY. We present how to create your own dashboard using open-source Python technologies like FastAPI, SQLAlchemy, and Celery and the challenges involved. We look back at the pitfalls and solutions we have worked on over the past 3 years. The goal is not to present our unique solution, but to show how we can combine different Python libraries to implement custom solutions to solve different use cases. Attendees should be familiar with the basic concepts of web infrastructure. Previous knowledge of any libraries is not required. We hope to provide a starting point to build your custom dashboard solution using open-source tooling.
In this presentation, we showcase the difficulties and challenges in creating a dashboard solution based on three real-world examples that offer an alternative to standard commercial dashboard products. We start our talk with an introduction to the problem that we tried to solve with the dashboard. We introduce custom dashboard solutions and offer attendees solutions to avoid our pitfalls. We structure our presentation on three broader topics based on the use case.
First, we discuss web frameworks and why we decided to work with FastAPI. We provide a short overview of different web frameworks and raise some questions that should be considered when starting a new project. Second, we explain how to link possible data sources and what we learned about using different Object-relational mapping (ORM) libraries. Namely, we discuss Orator ORM and SQLAlchemy. Thirdly, we analyze various performance issues and the usage of task queues to overcome performance problems. Finally, we provide a small peek into our dashboard solution and hope to start a small discussion and answer questions.
In minutes 0-5, we introduce the problem that we tried to solve. In minutes 5-10, we present how to find a web framework that matches your requirements. In minutes 10-15, we discuss how to link data sources (relational databases) to your dashboard. In minutes 15-20, we demonstrate how to use task queues to parallelize calculation tasks. In the next 2 minutes, we wrap-up the discussed discussions and provide a short overview over the resulting system. The remaining time is used for Q&A and hopefully a discussion on tools to build your own custom interactive reporting open-source solution.
Jan DixZornitsa ManolovaDominik JanyCamille Koenders
2022-12-02T13:00:00+00:00
13:00
00:30
Talk Track I
cfp-93-converting-sentence-transformers-models-to-a-single-tensorflow-graph
https://global2022.pydata.org//cfp/talk/JRLBJQ/
false
Converting sentence-transformers models to a single tensorflow graph
Talk
en
Getting predictions from transformer models such as BERT requires two steps: first to query the tokenizer and then feed the outputs to the deep learning model itself. These two parts of the model are kept under different class implementations in popular open source implementations like Huggingface Transformers and Sentence-Transformers. This works well within Python but when one wants to put such a model in production or convert it to more efficient formats like onnx that may be served by other languages such as JVM-based it is preferable and simpler (and less risky) to have a single artifact that is directly queried. This talk builds on the popular sentence-transformers library and shows how one can transform a sentence-transformer model into a single tensorflow artifact that can be queried with strings and is ready for serving. At the end of the talk the audience will get a better understanding of the architecture of sentence-transformers and the required steps for converting a sentence-transformer model to a single tensorflow graph. The code is released as a set of notebooks so that the audience can replicate the results.
This is my first pyData talk proposal and your feedback to its elements is most welcome. It is built on my lessons learned when trying to prepare a BERT-type of model to be shipped in production. The code of the talk is ready and tested and will be published as notebooks as part of the talk. The code currently lives at my github (https://github.com/balikasg/tf-exporter) and the main functionality is in https://github.com/balikasg/tf-exporter/tree/master/src/tf_packager and is developed further. The problem the talk aims to discuss is a common and interesting one without a complete solution yet, see for instance [1][2][3][4].
I will showcase how one can transform a sentence transformer model that consists of its custom tokenizer and the DL model itself to a single tf graph that can be queried directly with strings. This is particularly useful when one puts such a model in production as it reduces the probability of misalignment of the model’s tokenizer and transformer parts and also removes the need for constructing and maintaining two APIs (one for tokenization and another for querying the DL model).
There will be 4 main parts in the presentation that will use ~25 minutes leaving 5 minutes for questions. Throughout the talk, I will use sentence-transformers as the main framework and I will show how models residing in the sentence-transformers huggingface space or are trained (or fine-tuned) with sentence-transformers can be converted to a tf graph ready for serving. Here is the break-down of the talk:
0-5 min: Introduction. In this first part of the talk I will introduce myself and I will state what the main problem is. In particular, I will show that getting predictions from a transformer model requires two distinct steps, tokenization and querying the models and I will discuss the advantages of having a single artifact when putting such a model in production. Note that while having a single graph is not explicitly needed for serving purposes only, when serving a model we care a lot about reducing the risk of any failure. I expect this element of the talk to make it really interesting for ML engineers in the audience or folks who try to serve such models because simplifying the process of querying the model is of utmost importance for them.
5-12 min: I will show what are that actual model files and configs that sentence-transformers framework persists which will motivate and explain the conversion steps. Essentially, a sentence-transformers model is a sequence of models. In the most minimal scenario it consists of a tokenizer and a transformer model whereas in more elaborate model architectures we can have other types of layers (e.g, normalization layer, dense layers, LSTM layers, …) on top of the transformer outputs of the minimal example. I will use three models of increasing complexity as examples: [nq-distilbert-base-v1](https://huggingface.co/sentence-transformers/nq-distilbert-base-v1) (implements the minimal example), [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) (has normalization on top of nq-distilbert) and the [distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) model (has also a dense layer on top of the transformer output). Describing this will let the audience understand some of the architectural choices within sentence-tranformers but also motivate the solution.
12-22 min: I will describe the solution showing how each of the components of the example models can be converted to a tensorflow component and how all these tensorflow components can be tied together as a sequential neural network. The highlights of this part will be how to convert the tokenizer using native tf.text operations and how to convert a pytorch dense layer into a tensorflow dense layer with the same parameters. This is quite interesting because it touches on cross-framework model transformations (tensorflow vs pytorch) which is not trivial. Completing this description essentially will clearly demonstrate the framework where in order to get a single tensorflow graph for a given model it suffices to convert each of the components and tie them together in a sequential model.
22-25 min: As a conclusion to this talk I will show a demo where each of the aforementioned models will be converted to a single tf artifact. As a sanity check I will compare the predictions of the sentence-transformer models for example queries with those one gets from the complete graph. A public repository with the conversion notebooks will be shared with the audience.
As a side-note I hope to announce a pip package that generalizes the code of the talk to a package whose main functionality at the point will be to convert a sentence-transformer model to a single tensorflow graph that includes the tokenizer, the transformer and any additional layers that are applied on top of it.
[1]https://github.com/kpe/bert-for-tf2/issues/75
[2]https://github.com/huggingface/transformers/issues/13985
[3]https://github.com/microsoft/onnxruntime-extensions/issues/164
[4]https://stackoverflow.com/questions/71035590/how-can-i-combine-a-huggingface-tokenizer-and-a-bert-based-model-in-onnx
Georgios Balikas
2022-12-02T13:30:00+00:00
13:30
00:30
Talk Track I
cfp-32-steering-a-data-science-project
https://global2022.pydata.org//cfp/talk/HDNA9X/
false
Steering a data science project
Talk
en
Starting a new data science project is an exciting time, full of exotic models possibilities and faraway incredible features. However this ocean of potentialities is treacherous and the risks of veering off numerous.
This talk aims to provide a checklist to help you set a course for your data science project, and keep it. An industrial project about images pseudo-classification will be used as a working example.
The early stages of data science projects are full of the potentialities and wonders of discovery. Yet, to be able to see the end of a project, you need a clear vision of your final destination, some idea on how to reach it and methods to confirm you are on the right course.
While each project represents its own uncharted sea, there are a lot of tips, tricks and tools to navigate the unknown. I aim to compile a few of them in a checklist (PORTULAN) here and use an industrial image processing project as a practical application (retail products image embedding and comparison).
While this talk project is data science focused, a lot of the tips are more project management best practices and apply outside the domain. People having experience in data science in the industry context are more likely to relate with the practical application.
What you can expect from this talk:
- a portulan as takeaway checklist to help steer a data project
- an example application of said checklist
- bad maritime puns
Morgane MahaudMorgane Mahaud
2022-12-02T14:00:00+00:00
14:00
00:30
Talk Track I
cfp-7-the-pythonic-common-chemical-universe
https://global2022.pydata.org//cfp/talk/SZXLKG/
false
The Pythonic Common Chemical Universe
Talk
en
The virtual chemical universe is expanding rapidly as open access titan databases Enamine Database (20 Billion), Zinc Database (2 Billion), PubMed Database (68 Million) and cheminformatic tools
to process, manipulate, and derive new compound structures are being established. We present our open source knowledge graph, Global-Chem, written in python to distribute dictionaries of common chemical lists of relevant to different sub-communities out to the general public i.e What is inside Food? Cannabis? Sex Products? Chemical Weapons? Narcotics? Medical Therapeutics?
To navigate new chemical space we use our data as a reference index as to help us keep track of common patterns of interest and help us explore new chemicals that could be theoretically real. In our talk, we will present the chemical data, the rules governing the data and it's integrity, and how to use our tools to understand the chemical universe with python.
Selecting chemical compounds requires expertise. Expertise is gained by experience and studying a dedicated discipline. Dedicated disciplines most often have a set of common functional groups that are relevant to that community, this allows us to focus on compounds that are valuable. We do not need all the compounds, since a lot of them are not useful or not possible. In our talk, we describe how Global-Chem, an open source knowledge graph, was developed to facilitate the ability of scientists in both academia and industry to make their compounds of interest readily available to the scientific community in the form of objects that may be directly accessed from python.
/media/cfp/submissions/SZXLKG/179372564-c286b115-af14-4ad8-a37f-0a216297b6c1_c5ITD3G.png
Suliman Sharif
2022-12-02T15:00:00+00:00
15:00
01:00
Talk Track I
cfp-301-keynote-dj-patil
https://global2022.pydata.org//cfp/talk/KQLTJ7/
false
Keynote - DJ Patil
Keynote
en
DJ Patil is the former U.S. Chief Data Scientist
DJ is a board member for Devoted Health and former CTO. He’s a Senior Fellow at the Harvard Belfer Center, an Advisor to Venrock Partners, a member of the DoD's Science Board. Most recently, he was Senior Staff and CTO for the Biden-Harris Transition. Dr. Patil was appointed by President Obama to be the first U.S. Chief Data Scientist where he establishment nearly 40 Chief Data Officer roles. He also was Chief Scientist, Chief Security Officer and Head of Analytics and Data Product Teams at the LinkedIn where he co-coined the term ``Data Scientist``.
/media/cfp/submissions/KQLTJ7/Screen_Shot_2022-11-16_at_3.17.46_PM_0tWbfqd.png
DJ Patil
2022-12-02T16:00:00+00:00
16:00
00:30
Talk Track I
cfp-185-testing-big-data-applications-spark-dask-and-ray-
https://global2022.pydata.org//cfp/talk/ZCJSRH/
false
Testing Big Data Applications (Spark, Dask, and Ray)
Talk
en
Data practitioners use distributed computing frameworks such as Spark, Dask, and Ray to work with big data. One of the major pain points of these frameworks is testability. For testing simple code changes, users have to spin up local clusters, which have a high overhead. In some cases, code dependencies force testing against a cluster. Because testing on big data is hard, it becomes easy for practitioners to avoid testing entirely. In this talk, we’ll show best practices for testing big data applications. By using Fugue to decouple logic and execution, we can bring more tests locally and make it easier for data practitioners to test with low overhead.
Distributed computing engines such as Spark, Dask, and Ray allow data practitioners to scale their data processing over a cluster of machines. The problem is that debugging and testing distributed computing code is notoriously hard. This is for both when iterating and for unit testing. Stacktraces can often become cryptic because distributed computing uses futures or async code under the hood.
More importantly, using these frameworks can lock in testing to depend on a cluster. Some libraries such as databricks-connect make it convenient to run code on a Spark cluster, but all testing code ends up running on the cluster as well. On a code level, distributed computing code often is a combination of logic and execution behavior. This makes it hard to unit test logic without testing execution code as well.
Ideally, we want to test as much as possible locally. This speeds up iteration time and decreases computing expenses. Keeping code closer to native Python or Pandas also gives more intuitive tracebacks when errors arise. When we’re production-ready, we can run the full test suite on the cluster.
Fugue is an abstraction layer for distributed computing that ports Python and Pandas code to Spark, Dask, and Ray. By using an abstraction layer, we can code in native Python or Pandas rather than using big data frameworks. Decoupling logic and execution dramatically reduces the overhead to run tests because tests can be run locally on Pandas or Python. Unit tests are easier to write because they focus on business logic on smaller data. When production-ready, Fugue allows users to toggle the execution engine and then run the same code in a cluster seamlessly.
Outline:
- Introduction to testing (4 mins)
- Testing code during iterating
- Unit tests
- Unit tests for data
- Why is it hard to test PySpark code? (6 mins)
- There is a lot of boilerplate code
- Testing can require a cluster depending on the setup
- Execution behavior can be baked into functions
- Stacktraces can be hard to read
- Comparisons to Dask and Ray (3 mins)
- What is the ideal setup? (4 mins)
- Test as much as possible locally
- Bring to the cluster when production-ready
- Run the unit test suite without a cluster
- Fugue abstraction layer (4 mins)
- Decoupling of logic and execution
- Fugue transform function (6 mins)
- Scaling functions to Spark/Dask/Ray
- Type hint conversion
- Testing code in native Python and Pandas (6 mins)
- Unit tests become easier to write
- Stack traces for debugging become easier to read
- Running full tests on a cluster (3 mins)
- Simply swap the execution engine for running tests
- Conclusion and questions (4 minutes)
Han Wang
2022-12-02T16:30:00+00:00
16:30
00:30
Talk Track I
cfp-260-let-s-discover-drugs-using-deep-learning
https://global2022.pydata.org//cfp/talk/E7BKUX/
false
Let's Discover Drugs using Deep Learning
Talk
en
This talk will go into how Deep Learning is changing the world of Cheminformatics. We will dive deep into how we can leverage traditional NLP Transformer models can enable us to performing a totally uncorrelated task such as Drug Discovery. This talk will give a brief introduction to the field of Cheminformatics and then go into detail as to how and what kind of Transformers can be utilized for the task at hand.
The Drug Discovery market is poised to be estimated at $75 Billion dollars by 2025 and the recent pandemic has taught us how crucial it is to be more agile in the process of Drug Discovery. Due to the recent advances in Deep Learning, Computational Drug Discovery has been possible with great execution speed and scale. The talk is outlined as:
1. Background: This section will go in the relevant background required for understanding cheminformatics problems from a Machine Learning perspective as well as give an intuition into the importance of the problem.
2. Deep Learning Methodologies: This section will dive into how the problem statements are formulated in a DL setup and how these are effectively tackled using the recent advancements in Graph Neural Networks and primarily Transformers (the focus of the talk).
Rahul Baboota
2022-12-02T17:00:00+00:00
17:00
00:30
Talk Track I
cfp-62-mlops-for-the-rest-of-us-a-poor-man-s-guide-to-putting-models-in-production
https://global2022.pydata.org//cfp/talk/RXGYC7/
false
MLOps for the rest of us: A poor man's guide to putting models in production
Talk
en
What if you're a two man machine learning team deploying models to users? What if you don't have a full blown team of Data Engineers working with you? What if nobody around you cares about making that nasty production data available in a pristine feature store? What if you don't even have time to build out your entire Machine Learning platform?
There must be a way to still deliver your ML model to users right? There must be way to deliver value.
In this session, I'll talk about how small teams address the problem of delivering ML-value to users. At a reasonable scale. I'll go over some misconceptions and lessons-learned from 4 years working with early-stage startups.
#### Summary & Objective
MLOps doesn't have to be a monster. There is a set of principles/rules/methodologies that one can follow to deliver Machine Learning models to production. And it all comes down to following some standard practices that have been used in Software development for years now.
The main objective of this talk is to give attendees some insights on how to take MLOPs into their own hands and get that model to production. This will be done through story telling (e.g., some MLOps stories from the trenches), and practical examples (from my experience).
By the end of the talk, attendees should have increased confidence that they can take the MLOPs problem, and solve it. At their own scale.
#### Outline
- Cover slide
- Speaker introduction (2 min)
- What we'll talk about (2 min)
- You're not LinkedIn (4 min)
- The size of large tech companies
- Examples of startups I worked with
- Cost is a concern
- Time to value is a concern
- MLOPs (4 min)
- It's all the rage
- But what's the main goal here?
- How to solve small scale ML Problems (15 min)
- Communicating with the business
- "We need an ML model"
- The Data, what if you don't even have it?
- Scrapping
- Continuous learning
- Start uber small
- Packaging
- MakeFile
- Unit tests
- Continuous integration/Delivery
- Delivering models
- FastAPI
- Docker based services (eg., Cloud Run or similar)
- Monitoring models
- Evidently
- Prediction logging
- Making a platform
- Cookiecutter
- Key takeaways
- You can go a long way with 2 or 3 tools
- Make sure your ML Engineers know how to make software
- Keep the main goal in mind
#### Prior knowledge
Basics of Python and Machine Learning
Duarte Carmo
2022-12-02T17:30:00+00:00
17:30
00:30
Talk Track I
cfp-172-how-to-eliminate-the-i-o-bottleneck-and-continuously-feed-the-gpu-while-training-in-the-cloud
https://global2022.pydata.org//cfp/talk/XNUDTH/
false
How to Eliminate the I/O Bottleneck and Continuously Feed the GPU While Training in the Cloud
Talk
en
Model training is a time-consuming, data-intensive, and resource-hungry phase in machine learning, with much use of storage, CPUs, and GPUs. The data access pattern in training requires frequent I/O of a massive number of small files, such as images and audio files. With the advancement of distributed training in the cloud, it is challenging to maintain the I/O throughput to keep expensive GPUs highly utilized without waiting for access to data. The unique data access patterns and I/O challenges associated with model training compared to traditional data analytics necessitate a change in the architecture of your data platform.
In this talk, Lu Qiu will introduce a new architecture to optimize I/O in the entire data pipeline and maintain the throughput required by the GPU. Also, she will share how to implement this new architecture in Kubernetes for Pytorch workloads in the public cloud.
Lu Qiu
2022-12-02T18:00:00+00:00
18:00
01:00
Talk Track I
cfp-302-keynote-gabriela-de-queiroz
https://global2022.pydata.org//cfp/talk/3VRJLZ/
false
Keynote - Gabriela de Queiroz
Keynote
en
Gabriela de Queiroz is a Principal Cloud Advocate Manager at Microsoft.
At Microsoft, Gabriela leads and manages the Global AI/ML/Data team in Education Advocacy.
Before that, she worked at IBM as a Program Director on Open Source, Data & AI Technologies and then as Chief Data Scientist at IBM, leading AI Strategy and Innovations.
She is the founder of AI Inclusive, a global organization that is helping increase the representation and participation of gender minorities in Artificial Intelligence. She is also the founder of R-Ladies, a worldwide organization for promoting diversity in the R community with more than 200 chapters in 55+ countries.
She has worked in several startups and where she built teams, developed statistical models, and employed a variety of techniques to derive insights and drive data-centric decisions. She likes to mentor and share her knowledge through mentorship programs, tutorials, and talks.
/media/cfp/submissions/3VRJLZ/Screen_Shot_2022-11-16_at_3.17.39_PM_r9bCOEh.png
Gabriela de Queiroz
2022-12-02T19:00:00+00:00
19:00
00:30
Talk Track I
cfp-171-100x-faster-networkx-dispatching-to-graphblas
https://global2022.pydata.org//cfp/talk/KTEUCA/
false
100x Faster NetworkX: Dispatching to GraphBLAS
Talk
en
NetworkX is the most popular graph/network library in Python. It is easy to use, well documented, easy to contribute to, extremely flexible, and extremely slow for large graphs.
An upcoming release begins to fix that last issue by calling fast GraphBLAS implementations instead of the native Python implementation.
If you use NetworkX or have ever written a graph algorithm, this talk will be of interest to you as it shows how NetworkX is planning on a path of pluggable algorithm libraries so users can opt-in to faster implementations with minimal code changes.
NetworkX is extremely popular (4M downloads/week on PyPI) and is the usual entry point for Python users beginning their journey of network and graph analysis. The documentation is superb, tutorials exist, the API is stable. It is a fantastic library until it isn’t… and that point usually happens when attempting to analyze large graphs.
Because it is written in pure Python using a dict-of-dicts model, performance is orders of magnitude slower than highly tuned C/C++ libraries. However, while other libraries may be fast, they lack the useability and community of NetworkX.
One of these highly tuned libraries is python-graphblas which uses linear algebra to write elegant, yet highly efficient graph algorithms. A brief overview of GraphBLAS will be given.
This talk will showcase calling the NetworkX API with a GraphBLAS object instead of a NetworkX Graph. This will automatically dispatch to the matching GraphBLAS implementation, resulting in an impressive speedup.
We will also show how authors of other graph libraries can integrate with NetworkX and achieve similar results.
This is an experimental feature. We hope to get feedback from the community to continue making NetworkX even better.
Jim KitchenErik WelchMridul Seth
2022-12-02T19:30:00+00:00
19:30
00:30
Talk Track I
cfp-169-vaex-the-perfect-dataframe-library-for-python-data-apps
https://global2022.pydata.org//cfp/talk/DV3ZGE/
false
Vaex: the perfect DataFrame Library for Python data apps
Talk
en
Vaex is an incredibly powerful DataFrame library that allows one to work with datasets much larger than RAM on a single node. It combines memory mapping, lazy evaluations, efficient C++ algorithms, and a variety of other tricks to empower your off-the-shelf laptop and make it crunch through a billion samples in real time.
A common use-case for Vaex is as a backend for data apps, especially if one needs to process, transform, and visualize a larger amount of data in real time. Vaex implements a number of features that have been specifically designed to improve performance of data hungry dashboards or apps, namely:
- caching
- async evaluations
- early stopping of operations
- progress bars
In this talk we will showcase how you can use these features to build efficient dashboards and data apps, regardless of the data app library you prefer using.
Working with datasets comprising millions or billions of samples is an increasingly common task, one that is typically tackled with distributed computing. Nodes in high-performance computing clusters have enough RAM to run intensive and well-tested data analysis workflows. More often than not, however, this is preceded by the scientific process of cleaning, filtering, grouping, and other transformations of the data, through continuous visualizations and correlation analysis. In today’s work environments, many data scientists prefer to do this on their laptops or workstations, as to more effectively use their time and not to rely on spotty internet connection to access their remote data and computation resources. Modern laptops have sufficiently fast I/O SSD storage, but upgrading RAM is expensive or impossible.
Applying the combined benefits of computational graphs, which are common in neural network libraries, with delayed (a.k.a lazy) evaluations to a DataFrame library enables efficient memory and CPU usage. Together with memory-mapped storage (Apache Arrow, HDF5) and out-of-core algorithms, we can process considerably larger data sets with fewer resources. As an added bonus, the computational graphs ‘remember’ all operations applied to a DataFrame, meaning that data processing pipelines can be generated automatically.
The computational efficiency of Vaex makes it a particularly good candidate for a backend of data hungry dashboards or data apps. In fact Vaex implements a variety of features that are specifically designed to help build efficient and memory-safe dashboards such as caching, async evaluations, fingerprinting, etc. With such features, one can build data intensive applications and even low-code tools while keeping the infrastructure cost and complexity lower. In this talk, we will present how one can take advantage of these special features of Vaex when building data applications.
Jovan VeljanoskiMaarten Breddels
2022-12-02T20:00:00+00:00
20:00
00:30
Talk Track I
cfp-196-scale-data-science-by-pandas-api-on-spark
https://global2022.pydata.org//cfp/talk/U9XTNQ/
false
Scale Data Science by Pandas API on Spark
Talk
en
With Python emerging as the primary language for data science, pandas has grown rapidly to become one of the standard data science libraries. One of the known limitations in pandas is that it does not scale with your data volume linearly due to single-machine processing.
Pandas API on Spark overcomes the limitation, enabling users to work with large datasets by leveraging Apache Spark. In this talk, we will introduce Pandas API on Spark and help you scale your existing data science workloads using that. Furthermore, we will share the cutting-edge features in Pandas API on Spark.
Python data science has exploded over the last few years. pandas is the de facto standard (single-node) DataFrame implementation in Python, while Apache Spark is the de facto standard for big data processing.
Pandas API on Spark is a new module in Apache Spark that implements pandas API on top of Spark SQL. It enables optimization techniques by Spark Catalyst Optimizer, and provides an easy switch between pandas API and existing PySpark features.
In this talk, we will introduce how Pandas API on Spark optimizes single-machine performance and scales well beyond a single machine. Furthermore, we will highlight the latest updates of Pandas API on Spark.
Xinrong MengTakuya Ueshin
2022-12-02T20:30:00+00:00
20:30
01:00
Talk Track I
cfp-303-keynote-quincy-larson
https://global2022.pydata.org//cfp/talk/R7CWEZ/
false
Keynote - Quincy Larson
Keynote
en
Quincy Larson is the Founder of freecodecamp.org.
Quincy Larson is a teacher and school director from Oklahoma. At age 31, he started learning to code using free online courses and attending hackathons. After working as a software engineer, he founded freeCodeCamp.org to help other busy adults also learn to code and transition into tech careers. More than a million people now use freeCodeCamp courses each day, and 10,000s of people have used it to successfully transition into software development careers.
/media/cfp/submissions/R7CWEZ/Screen_Shot_2022-11-16_at_3.14.54_PM_sokJR4I.png
Quincy Larson
2022-12-02T21:30:00+00:00
21:30
00:30
Talk Track I
cfp-146-everything-you-need-to-know-about-transformer-models
https://global2022.pydata.org//cfp/talk/BWU9YV/
false
Everything you need to know about Transformer Models
Talk
en
Transformer models are all around in the deep learning community and this talk will help to better understand why transformers achieve such impressive results. Using various explainability techniques and plain numpy examples, participants will gain an understanding of the attention mechanism, its implementation, and how it all comes together.
Transformer-based models have revolutionized Natural Language Processing and are also increasingly applied in Computer Vision. For example, they achieve impressive results generating images or translating text. Many people probably remember the sometimes strange or funny translations of Google Translate and similar services, which nowadays are highly accurate – thanks to the transformer models working behind the scenes.
The key components of a transformer are so-called attention heads that are often claimed to mimic the way the human brain processes texts and images. Mathematically, attention is not much more than matrix multiplication, but what’s actually going on here? How can layers of these learn to achieve impressive results?
In this talk, we will take a practical approach to understand the inner workings of transformers by implementing a basic example in numpy and visualize how attention works.
This talk is aimed at a general audience. No in-depth knowledge of maths or machine learning is required to follow this talk. Aside from familiarity with the basics of numpy, all you need is your curiosity.
Mike Rothenhäusler
2022-12-02T22:00:00+00:00
22:00
00:30
Talk Track I
cfp-248-you-don-t-need-a-cluster-for-that-using-embedded-sql-engines-for-plotting-massive-datasets-on-a-laptop
https://global2022.pydata.org//cfp/talk/VQRC77/
false
You don't need a cluster for that: using embedded SQL engines for plotting massive datasets on a laptop
Talk
en
This talk will show you a simple yet effective technique to visualize larger-than-memory datasets on your laptop by leveraging SQLite or DuckDB. No need to spin up a Spark cluster!
Data visualization is an essential skill for every data practitioner. The typical approach for visualizing data is to use pandas for data cleaning and matplotlib (or seaborn) for visualization. However, this approach falls short when we want to visualize large datasets since pandas and matplotlib require you to load the entire dataset into memory.
In such cases, a practitioner might think of using a Spark or Dask cluster; however, this adds a lot of complexity since the cluster needs configuration and maintenance. Furthermore, this might not be possible if you don't have access to such infrastructure.
This talk will show you how to use SQLite (or DuckDB) to plot massive datasets from your laptop efficiently. With this approach, there is no need to maintain extra infrastructure, and you'll be able to plot datasets that do not fit into memory without additional infrastructure.
Outline:
[0 - 4 minute] Why pandas fails
[4 - 10] How can SQLite/DuckDB helps us scale data visualization
[10 - 18] Use case: plotting histograms
[18 - 26] Use case: plotting boxplots
[26 - 28] Summary and conclusions
[28 - 30] Q&A
Eduardo Blancas
2022-12-02T09:00:00+00:00
09:00
00:30
Talk Track II
cfp-229-responsible-ai-what-why-how-and-future-
https://global2022.pydata.org//cfp/talk/T9HHPK/
false
Responsible AI - What, Why, How and Future!
Talk
en
Mostly, people relate Artificial Intelligence to progress, intelligence and productivity. But with this comes unfair decisions, biases, human workforce being replaced, lack of privacy and security. And to make matters worse, a lot of these problems are specific to AI. This indicates that the rules and regulations in place are inadequate to deal with them. Responsible AI comes into play in this situation. It seeks to resolve these problems and establish AI system responsibility. In this talk I am going to talk about What is Responsible AI, Why is it needed, How it can be implemented, What are the various frameworks for Responsible AI and What is the Future?
In this talk I am going to talk about What is Responsible AI, Why is it needed, How it can be implemented, What are the various frameworks for Responsible AI and What is the Future? I will discuss the three major Responsible AI Frameworks - Google, Microsoft, and IBM.
The development of AI is creating new opportunities to improve the lives of people around the world, from business to healthcare to education. It is also raising new questions about the best way to build fairness, interpretability, privacy, and security into these systems.
The first area mentioned is interpretability. When we interpret a model we get an explanation for how it makes predictions. An AI system could reject your application for a mortgage or diagnose you with cancer. A user would likely demand an explanation even if these decisions are correct. Some models are easier to interpret than others making it easier to get explanations. Responsible AI can define how we build interpretable models or when it is okay to use one that is less interpretable.
Related to interpretability is model fairness. It is possible for AI systems to make decisions that discriminate against certain groups of people. This bias comes from bias in the data used to train models. In general, the more interpretable a model the easier it is to ensure fairness and correct any bias. We still need a Responsible AI framework to define how we evaluate fairness and what to do when a model is found to make unfair predictions. This is especially important when using less interpretable models.
Safety and security is another concern. These are not new to software development and are address by techniques like encryption and software tests. The difference is that, unlike general computer systems, AI systems are not deterministic. When faced with new scenarios they can make unexpected decisions. The systems can even be manipulated to make incorrect decisions. This is particularly concerning when we are dealing with robots. If they make errors, things like self-driving cars can cause injury or death.
The last aspect is privacy and data governance. The quality of data used is important. If there are mistakes in the data used by AI then the system may make incorrect decisions. In general, AI should also not be allowed to use sensitive data.
Ultimately, what this all boils down to is trust. If users do not trust AI they will not use your services. We won’t trust systems that use the information we are uncomfortable with sharing or we think it will make biased decisions. We certainly won’t trust it if we think it will cause us physical harm. Explanations for decisions and accountability for those decisions go a long way in building this trust. This need for trust is what is driving self-regulation amongst companies that use AI.
Dr. Sonal Kukreja
2022-12-02T10:00:00+00:00
10:00
00:30
Talk Track II
cfp-297-building-large-scale-localized-language-models-from-data-preparation-to-training-and-deployment-to-production-
https://global2022.pydata.org//cfp/talk/AQLSAY/
false
Building Large-scale, Localized Language Models: From Data Preparation to Training and Deployment to Production.
Talk
en
Recent advances in natural language processing demonstrate the capability of large-scale language models (such as GPT-3) to solve a variety of NLP problems with zero shots shifting from supervised fine-tuning to prompt engineering/tuning.
However, building large language models raises data preparation, training, and deployment challenges. In addition, while the process is well-established for a few dominant languages, such as English, its execution in localized languages remains limited. We'll give an overview of the end-to-end process for building large-scale language models, discuss the challenges of scaling, and describe some existing solutions for efficient data preparation, distributed training, model optimization, and distributed deployment. We'll use examples in localized languages such as French or Spanish using NVIDIA Nemo Megatron, a framework for training large NLP models optimized for SuperPOD hardware infrastructure.
Miguel MartínezMeriem Bendris
2022-12-02T10:30:00+00:00
10:30
00:30
Talk Track II
cfp-221-maps-maps-maps-
https://global2022.pydata.org//cfp/talk/GLJX3M/
false
Maps, Maps, Maps!
Talk
en
Python has many different packages that are useful for working with different kinds of geographical data. This presentation will introduce several of these packages and show you how you can get started working with geolocated information and presenting insights on maps.
Python has a rich ecosystem for working with maps and geolocated data. This talk will present several of the most popular packages for working with geographical information and give you a starting point for your own explorations.
You'll be introduced to:
- `folium` to create interactive, JavaScript-driven maps
- `shapely` to describe geometric objects like points, lines, and polygons
- `geopandas` to analyze data tied to geographical objects
- `pyproj` to manipulate geo-coordinates
- `rasterio` to work with gridded geographical data
There are several other packages connected to these that you'll hear about and learn which role they play. The presentation will include a live demonstration of different packages and their capabilities. You'll learn how you can get started working with geo-data in Python and understand the strengths of the different libraries available.
Geir Arne Hjelle
2022-12-02T11:30:00+00:00
11:30
00:30
Talk Track II
cfp-140-bert-s-achilles-heel-applying-contrastive-learning-to-fight-anisotropy-in-language-models-
https://global2022.pydata.org//cfp/talk/FULGZZ/
false
BERT's Achilles' heel? Applying contrastive learning to fight anisotropy in language models.
Talk
en
Transformer models became state-of-the-art in natural language processing. Word representations learned by these models offer great flexibility for many types of downstream tasks from classification to summarization. Nonetheless, these representations suffer from certain conditions that impair their effectiveness. Researchers have demonstrated that BERT and GPT embeddings tend to cluster in a narrow cone of the embedding space which leads to unwanted consequences (e.g. spurious similarities between unrelated words). During the talk we’ll introduce SimCSE – a contrastive learning method that helps to regularize the embeddings and reduce the problem of anisotropy. We will demonstrate how SimCSE can be implemented in Python.
**Brief Bullet Point Outline**
• Introduction (1 min)
• A refresher on Transformer model (3 min)
• What is anisotropy? (3 min)
• Contrastive learning – what and why? (5 min)
• Embeddings and SimCSE in Python (13 min)
• Q&A (5 min)
**Prerequisites**
People of all backgrounds and experience levels are invited to participate in the talk. However, to get the most out of the presentation, the following skills are recommended:
• Familiarity with Python & Huggingface library
• Good understanding of basic NLP concepts, in particular embeddings
• Basic understanding of Transformer architecture
• Solid basics of supervised and unsupervised learning
Aleksander Molak
2022-12-02T12:00:00+00:00
12:00
00:30
Talk Track II
cfp-232-how-to-properly-test-ml-models-data
https://global2022.pydata.org//cfp/talk/FBLQAZ/
false
How to Properly Test ML Models & Data
Talk
en
Automatic testing for ML pipelines is hard. Part of the executed code is a model that was dynamically trained on a fresh batch of data, and silent failures are common. Therefore, it’s problematic to use known methodologies such as automating tests for predefined edge cases and tracking code coverage.
In this talk we’ll discuss common pitfalls with ML models, and cover best practices for automatically validating them: What should be tested in these pipelines? How can we verify that they'll behave as we expect once in production?
We’ll demonstrate how to automate tests for these scenarios and introduce a few open-source testing tools that can aid the process.
- Short intro to software tests
- Testing ML
* Challenges
* Types of problems
* How, When, and What should we test for
- Tools and recommendations for incorporating tests it into your ML workflows
Shir Chorev
2022-12-02T12:30:00+00:00
12:30
00:30
Talk Track II
cfp-5-industrial-strength-dalle-e-scaling-complex-large-text-image-models
https://global2022.pydata.org//cfp/talk/N3SV3B/
false
Industrial Strength DALLE-E: Scaling Complex Large Text & Image Models
Talk
en
Identifying the right tools to enable for high performance machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. This becomes particularly emphasised when dealing with the ever growingly popular large language and image generation models such as GPT2, OTP and DALL-E, between others. In this session we will dive into a practical showcase where we will be productionising the large image generation model DALL-E, and showcase some optimizations that can be introduced as well as considerations as the use-cases scale. By the end of this session practitioners will be able to run their own DALL-E powered applications as well as integrate these with functionalities from other large language models like GPT2, etc. We will be leveraging key tools in the Python ecosystem to achieve this, including Pytorch, HuggingFace, FastAPI and MLServer.
Identifying the right tools to enable for high performance machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. This becomes particularly emphasised when dealing with the ever growingly popular large language and image generation models such as GPT2, OTP and DALL-E, between others. In this session we will dive into a practical showcase where we will be productionising the large image generation model DALL-E, and showcase some optimizations that can be introduced as well as considerations as the use-cases scale. By the end of this session practitioners will be able to run their own DALL-E powered applications as well as integrate these with functionalities from other large language models like GPT2, etc. We will be leveraging key tools in the Python ecosystem to achieve this, including Pytorch, HuggingFace, FastAPI and MLServer.
Alejandro Saucedo
2022-12-02T13:00:00+00:00
13:00
00:30
Talk Track II
cfp-4-metadata-systems-for-end-to-end-data-machine-learning-platforms
https://global2022.pydata.org//cfp/talk/V9NFUV/
false
Metadata Systems for End-to-End Data & Machine Learning Platforms
Talk
en
Organisations have been growingly adopting and integrating a non-trivial number of different frameworks at each stage of their machine learning lifecycle. Although this has helped reduce time-to-value for real-world AI use-cases, it has come at a cost of complexity and interoperability bottlenecks.
# Overview
Organisations have been growingly adopting and integrating a non-trivial number of different frameworks at each stage of their machine learning lifecycle. Although this has helped reduce time-to-value for real-world AI use-cases, it has come at a cost of complexity and interoperability bottlenecks.
Each stage in the end-to-end lifecycle involves different stakeholders that make decisions and perform actions that can modify data and/or ML components with use-case-specific but ever compoinding risks, resulting in a growing need to ensure a minimum-level of metadata is collected, tracked and managed. This becomes growingly important due to the need to ensure relevant overarching compliance requirements, as well as architectural requirements on lineage, auditability, accountability and reproducibility.
In this session we will dive into the challenges present in the metadata layer of large-scale systems, as well as tooling, best practices and solutions that can be adopted to tackle these challenges. We will discuss the rise of the metadata management systems, the challenges they have been able solve, as well as critical shortcomings where ecosystem-wide collaboration will be key from tooling-level alignemnt to ensure long-term robustness of these heterogeneous end-to-end platform.
## Benefits to the ecosystem
In recent years we have experienced how the evolution of the areas of DataOps and MLOps have introduced further complexities that involve concepts such as data-versioning, model-versioning, model registries, ML experiment tracking, ML model deployment, ML model promotion, monitoring, etc. These latter developments have raised new challenges that the ecosystem has been able to tackle through extending existing metadata management tools, as well as the creation of new tools.
This talk aims to help further the discussion on existing best practices where metadata schemas, protocols and tooling has been succesful in enabling interoperability across multiple systems in these end to end platforms. We hope that it brings benefits to the ecosystem that go beyond this current session into actionable discussions, collaborations and open source contributions towards continuing the momentum on improving interoperability across the MLops ecosystem.
Alejandro Saucedo
2022-12-02T13:30:00+00:00
13:30
00:30
Talk Track II
cfp-192-things-i-learned-running-neural-networks-on-microcontrollers
https://global2022.pydata.org//cfp/talk/Q3SBSF/
false
Things I learned running neural networks on microcontrollers
Talk
en
A somewhat beginner's guide on running neural networks on micro-controllers, understanding the training pipeline, deployment and how to update the deployed model
Running neural networks on production systems is quite difficult but running it on microcontrollers is different. The choice of the microcontroller, presence of a purpose-built processor, data I/O, model training and inferencing - all things change when the target deployment scenario changes from a cloud instance to a power-constraints microcontroller. In this talk, I will go through how to go about it as a novice and get a model running.
/media/cfp/submissions/Q3SBSF/saradindu_yrEol77.jpg
SARADINDU SENGUPTA
2022-12-02T14:00:00+00:00
14:00
00:30
Talk Track II
cfp-243-extending-awkward-array-into-the-broader-pydata-ecosystem
https://global2022.pydata.org//cfp/talk/GEHBLR/
false
Extending Awkward Array into the broader PyData Ecosystem
Talk
en
The Awkward Array project provides a library for operating on nested,
variable length data structures with NumPy-like idioms. We present two
projects that provide native support for Awkward Arrays in the broader
PyData ecosystem. In dask-awkward we have implemented a new Dask
collection to scale up and distribute workflows with partitioned
Awkward Arrays. In awkward-pandas we have implemented a new Pandas
extension array type, making it easy to use Awkward Arrays in Pandas
workflows and enabling massive acceleration in the processing of
nested data. We will show how these projects plug into PyData and
present some compelling use cases.
Dask provides core collections for scaling up workflows that use NumPy
arrays and Pandas DataFrames. The collection interface defined by Dask
allows for the creation of custom collections. We will describe how we
created a collection to bring support to Awkward Arrays in Dask, and
explain some of the advantages of using a native collection over
alternatives in the Dask ecosystem: for example, we are able to
leverage Dask's high-level task graph layers and implement dedicated
optimizations.
Pandas provides the ExtensionArray interface to create and register
new types of Arrays that Pandas can recognize. We will show, with
example use-cases, how adding a new Awkward ExtensionArray improves
the performance of operating on nested data in Pandas workflows. For
example, Python for-loops in nested Pandas workflows can be sped of by
more than 100x with equivalent Awkward use.
The core purpose of developing dask-awkward and awkward-pandas is to
make nested, JSON-like data a first class type in the PyData ecosystem
that can be analyzed efficiently and at scale.
Doug Davis
2022-12-02T14:30:00+00:00
14:30
00:30
Talk Track II
cfp-252-practical-mlops-for-better-models
https://global2022.pydata.org//cfp/talk/VBG3LX/
false
Practical MLOps for better models
Talk
en
Machine learning operations (MLOps) are often synonymous with large and complex applications, but many MLOps practices help practitioners build better models, regardless of the size. This talk shares best practices for operationalizing a model and practical examples using the open-source MLOps framework `vetiver` to version, share, deploy, and monitor models.
Data scientists understand the pitfalls of building models; concepts such as overfitting have deeply thought out solutions built into data science workflows, but less thought is given to bringing models off a laptop. Adding MLOps practices such as versioning, deploying, and monitoring models avoids the pitfall of having model objects stuck on your personal machine. Building an MLOps strategy can sound daunting for data science teams, but these practices can be used in any size or scale of project (even projects that include multiple languages!) to create robust and reproducible models to be shared with others.
Listeners need no previous MLOps knowledge, but should have basic understanding of a data science or machine learning lifecycle. By the end of this talk, people will understand what the term MLOps entails and how to use MLOps to build better models. Listeners will also walk away with practical knowledge on how to use the open-source MLOps framework `vetiver` to version, share, deploy, and monitor models in Python, R, or both!
Isabel Zimmerman
2022-12-02T16:00:00+00:00
16:00
00:30
Talk Track II
cfp-166-on-creating-behavioral-profiles-of-your-customer-from-event-stream-data-introduction-to-cleora-the-open-source-tool-for-real-time-multimodal-modeling-
https://global2022.pydata.org//cfp/talk/7DUJCZ/
false
On creating behavioral profiles of your customer from event stream data – introduction to Cleora, the open-source tool for real time multimodal modeling.
Talk
en
We want to present Cleora – an open-source tool for creating compact representation of the behavior of your client. Cleora uses graph theory to transform streams of event data into embedding. It is suitable as an input for training models like churn, propensity and recommender systems. This is a talk useful for anyone who wishes to learn how to work with event data of clients and how to model client's behavior.
The objective of this talk is to introduce the audience to the open-source tool named Cleora (https://github.com/Synerise/cleora), which enables processing and embedding of big event data streams like client purchases, clicks on the webpage or card transactions, to name a few.
In many situations, data scientists struggle with creating a good representation of clients for two reasons:
* information comes from many different sources that cannot be easily combined together eg. static attributes of the client and her clicks and purchases in the app
* it is very resource consuming to process such big data streams and effective work is cumbersome
We observed that in order to tackle this issue, data scientists usually transform per client events into aggregated tabular form. This way they effectively lose a lot of latent information.
We ourselves, as a team of data scientists, struggled with this challenge and this is when Cleora was invented. We proved that it is a good way of representing behavior of the single client by being on a podium of a couple of competitions like SIGIR and KDD Cup, using exactly this solution.
Cleora embedding is effectively a compact profile of your client. It serves as an input to the neural network model – we can think of Cleora as an embedding model for events. During the talk, we will show how to use these embeddings for predicting churn among customers, modeling propensity for certain products and building recommender systems in a real time manner.
* This talk will be interesting for all machine learning engineer who are working with data streams and model behavior of users.
* It will be a hands-on talk with practical examples, however we will present the underlying technology, so you can expect a very gentle introduction to graphs.
* After the talk you will be able to create your own compact user embeddings and use it as an input to machine learning models.
* Prior experience with Python and modeling machine learning models is expected.
Dominika BasajBarbara Rychalska
2022-12-02T16:30:00+00:00
16:30
00:30
Talk Track II
cfp-48-improving-production-workflows-for-scikit-learn-models-with-skops
https://global2022.pydata.org//cfp/talk/WQQQQY/
false
Improving production workflows for scikit-learn models with skops
Talk
en
Production workflows in machine learning has it's own requirements compared to DevOps. In this talk, I will present a new library we are developing called "skops" that's built to improve production workflows for scikit-learn models.
Taking scikit-learn models to production or hosting them openly has it's own challenges, like reproducibility, safety and more. We are trying to tackle these problems with our new open-source library skops. skops provides easy APIs to host your models, automatically create widgets for inference and create model cards that documents the model for versioning. In this talk, I will walk you through how you can use skops for versioning models.
Merve Noyan
2022-12-02T17:00:00+00:00
17:00
00:30
Talk Track II
cfp-239-machine-learning-in-the-warehouse-with-python
https://global2022.pydata.org//cfp/talk/HPPMEJ/
false
Machine Learning in the Warehouse with Python
Talk
en
Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.
**Objective:** If you are a data scientist that already stores your data in a warehouse, this talk will teach and demonstrate how to run ML models with the new Snowpark Python library. If you are new to warehouse data storage, the demonstration walks through integrating a Snowflake database with a python notebook.
**(10-15 mins) Snowpark Overview:** We will run through the process of transforming data, training a model, and running the model while keeping all the data in one place. The Snowpark library **provides an intuitive API for querying and processing data in a data pipeline**.
**(15 mins) ML Model Demonstration:** The audience will be able to open the notebook and run the code themselves and leave with a more seamless ML workflow utilizing a pipeline in Python.
**Thesis:** Snowpark speeds up Python-based workflows with seamless access to open source packages and package manager via Anaconda Integration without having to move data.
This talk is for data scientists who have familiarity with data warehouses. A background in writing ML models in Python is recommended, but not necessary, as we will be going over the process from start to finish and providing all the code.
/media/cfp/submissions/HPPMEJ/python_warehouse_6VsITGc.png
Allan Campopiano
2022-12-02T17:30:00+00:00
17:30
00:30
Talk Track II
cfp-92-better-python-coding-with-prefect-blocks
https://global2022.pydata.org//cfp/talk/HFGGPM/
false
Better Python Coding with Prefect Blocks
Talk
en
Everyone who codes can save time by reusing configuration — whether for logging in to cloud providers or databases, spinning up Docker containers, or sending notifications. The Prefect open source library provides you with blocks - sharable, reusable, and secure configuration with code. Blocks can be created and edited through the Prefect UI or Python code, allowing for easier collaboration with team members of all skill levels.
You will learn how you can use the Prefect open source library to quickly create reusable blocks. Blocks give you superpowers for secure collaboration. This talk will be fun and informative. There will be code, concepts, and emojis. Software engineers, data engineers, and data scientists who access APIs, move data, or work with a range of technical and non-technical stakeholders will benefit by attending.
/media/cfp/submissions/HFGGPM/duck-car_WPe8mrK.jpg
Jeff Hale
2022-12-02T19:00:00+00:00
19:00
00:30
Talk Track II
cfp-228-data-pipelines-workflows-orchestrating-data-with-dagster
https://global2022.pydata.org//cfp/talk/UVFTLD/
false
Data pipelines != workflows: orchestrating data with Dagster
Talk
en
Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.
Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit - they schedule tasks, but miss that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage.
Dagster is an open-source framework and orchestrator built to help data practitioners develop, test, and run data pipelines. It takes a declarative approach to data orchestration that starts with defining data assets that are supposed to exist and the upstream data assets that they’re derived from.
Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.
Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.
Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit - they schedule tasks, but don’t understand that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage. They manage production workflows, but make it hard to work with pipelines in local development, unit tests, CI, code review, and debugging.
Dagster is an open-source framework and orchestrator built to help data practitioners develop, test, and run data pipelines. It takes a declarative approach to data orchestration that starts with defining data assets that are supposed to exist and the upstream data assets that they’re derived from. It lets the git repo become the source of truth on data, so pushing data changes feels as safe as pushing code changes. It supports an organization-wide data asset lineage graph, that can be subsetted for scheduled or ad-hoc execution. It’s built to facilitate data pipelines in local development, unit testing, CI, code review, staging environments, and debugging.
Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.
Sandy Ryza
2022-12-02T19:30:00+00:00
19:30
00:30
Talk Track II
cfp-208-daft-the-distributed-python-dataframe-for-complex-data-images-video-documents-and-more-
https://global2022.pydata.org//cfp/talk/P3J9JA/
false
Daft: the Distributed Python Dataframe for "Complex Data" (images, video, documents and more)
Talk
en
Daft is an open-sourced distributed dataframe library built for "Complex Data" (data that doesn't usually fit in a SQL table such as images, videos, documents etc).
**Experiment Locally, Scale Up in the Cloud**
Daft grows with you and is built to run just as efficiently/seamlessly in a notebook on your laptop or on a Ray cluster consisting of thousands of machines with GPUs.
**Pythonic**
Daft lets you have tables of any Python object such as images/audio/documents/genomic files. This makes it really easy to process your Complex Data alongside all your regular tabular data. Daft is dynamically typed and built for fast iteration, experimentation and productionization.
**Blazing Fast**
Daft is built for distributed computing and fully utilizes your all of your machine's or cluster's resources. It uses modern technologies such as Apache Arrow, Parquet and Iceberg for optimizing data serialization and transport.
Daft (https://www.getdaft.io) is an open-sourced dataframe framework:
1. Pythonic and built for "Complex Data" such as images, video and unstructured documents. Columns of the dataframe can be of any arbitrary Python type such as Numpy vectors, PIL Images or any user-defined type! Daft exposes an easy functional interface for loading, querying and processing this data.
2. Built for both interactive experimentation and distributed computing. Daft is built for a smooth local development experience in a REPL/notebook environment with a dynamic type system and intelligent caching. When running large workloads that require more computing power, it scales up seamlessly to thousands of machines on a cluster using Ray.
3. Built for Machine Learning workloads - Daft is perfect for performing data curation for ML training, or scaling up large scale ML inference. It integrates natively with the Ray and PyTorch ecosystem for training input data, efficiently transporting your data into ML training jobs.
/media/cfp/submissions/P3J9JA/daft_illustration_ZwaQnbh.png
Jay ChiaSammy Sidhu
2022-12-02T20:00:00+00:00
20:00
00:30
Talk Track II
cfp-55-testing-pandas-shoots-leaves-and-garbage-
https://global2022.pydata.org//cfp/talk/CZBP8K/
false
Testing Pandas: Shoots, leaves, and garbage!
Talk
en
"It works on my machine"... those dreaded words.
"I'm not a developer, I don't know how to test"... arghhh.
"Let QA test it"....
No more excuses. Learn how to debug and test Pandas code.
How do you structure Pandas code? How do you debug it? How do you test it?
In this talk, we will use real-world data to explore best practices for writing Pandas code, debugging it, managing data integrity, using pytest, and generating tests with Hypothesis.
No more excuses.
/media/cfp/submissions/CZBP8K/inkscape_Zwqpq1xv30_ahWhoTe.png
Matt Harrison
2022-12-02T21:30:00+00:00
21:30
01:00
Talk Track II
cfp-307-lightning-talks
https://global2022.pydata.org//cfp/talk/CYWTDZ/
false
Lightning Talks
Lightning Talks
en
<b>Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.</b>
<b>Order of Presentations</b>
1. Computational Report Generator by Archit Khosla
2. Toolkit for Transcribing Audio in Low Confidence Settings by Josh Seltzer
3. Makefiles for Automated Replacement of Modified Plots in Reports and Presentations by Cameron Devine PhD
4. Sailing through AIS data using movingpandas by Ray Bell
5. Bobby: An Open Source Template for Strong Linting on Python Projects by Aidan Russell
Archit KhoslaJosh SeltzerCameron Devine PhDRay BellAidan Russell
2022-12-02T08:00:00+00:00
08:00
01:30
Workshop/Tutorial I
cfp-189-level-up-you-jupyter-notebooks-with-vs-code
https://global2022.pydata.org//cfp/talk/WYCBXN/
false
Level up you Jupyter Notebooks with VS Code
Tutorial
en
Visual Studio Code is one of the most popular editors in the Python and data science communities, and the extension ecosystem makes it easy for users to easily customize their workspace for the tools and frameworks they need.
Jupyter notebooks are one such popular tool, and there are some really great features for working in notebooks that can reduce context switching, enable multi-tool workflows, and utilize powerful Python IDE features in notebooks.
This tutorial is geared for all Jupyter Notebook users, who either have interest in or are regularly using VS Code.
Participants will learn how to use some of the best VS Code features for Jupyter Notebooks, as well as a bunch of other tips and tricks to run, visualize and share your notebooks in VS Code.
Some familiarity with Jupyter Notebooks is required, but experience with VS Code is not necessary.
Materials and sample notebooks for the tutorial will be hosted on GitHub, which participants will be able to launch in their browser in the VS Code editor with GitHub Codespaces with no local setup.
Participants will also be encouraged if they have VS Code installed locally that they can open one of their own notebooks and try out the features as we go along.
I have been using both Jupyter Notebooks and VS Code for almost 10 years in a variety of academic and industrial contexts, and love sharing what I have learned along the way.
This tutorial is a collection of the most interesting tools/tips that I have learned or frequently demo to other notebook users in the scientific Python community <3
2 min: Introduction to the topic and speaker
4 min: Introducing a sample notebook project using pandas, matplotlib, and scikit-learn
12 min: Starting up a project in VS Code
5 min: Notebook UI
5 min: Configuring your kernel/where your code is run (local and remote)
2 min: Git is easy(er) with built-in source control in VS Code
20 min: Editing notebooks
8 min: Intellisense code hints/tips
4 min: Keyboard shortcut configuration and code snippet templates
8 min: Documentation + Linting extensions for Python
20 min: Executing Notebooks
6 min: Debugging code in notebooks
6 min: Testing in notebooks
3 min: Interactive widgets in notebooks
3 min: Extensions to help connect to cloud notebook kernels
20 min: Sharing Notebooks
5 min: Comparing notebook changes
5 min: Saving/exporting images and notebooks
10 min: Live Share for group programming + Codespaces
10 min: What's coming next for Notebooks in VS Code
2 min: Wrap-up and where to learn more
Sarah Kaiser (She/Her)
2022-12-02T10:00:00+00:00
10:00
02:00
Workshop/Tutorial I
cfp-147-working-session-for-the-bayesian-python-ecosystem
https://global2022.pydata.org//cfp/talk/BNUAL8/
true
Working session for the Bayesian Python Ecosystem
Workshops
en
There is a rich ecosystem of libraries for Bayesian analysis in Python and it is necessary to use multiple libraries at the same time to use a Bayesian workflow, from model creation to presenting results going through sampling and model checking.
This working session aims to bring together practitioners to discuss and address interoperability issues within the ecosystem. Attendees should expect a hands-on get together where they will meet other Bayesian practitioners with whom to discuss the issues faced and contribute to open source libraries with issues, pull requests and discussions.
The audience of this workshop are Bayesian practitioners or people interested in Bayesian analysis who are already comfortable with the core libraries of the scientific python ecosystem.
The workshop will start assessing which libraries are used by the attendees and explaining the support we can provide them for each of the libraries used, i.e. advise on how/where is it best to let maintainers know about interoperability issues, strategies to fix existing issues or support in submitting PRs.
We will then have an unstructured time/brainstorming session for attendees to discuss interoperability issues and choose what to work on. The goal is for attendees to form groups of 2-3 people and work together on 1-2 issues they agree on. We expect these initial activities to take 30-40 minutes, the rest of the time will be dedicated to group working.
We will have a few maintainers from multiple libraries in the ecosystem who will be available to provide support during the rest of the working session. Attendees will be able to discuss design ideas and limitations, get help with the contributing process (which may be a bit different for each library), get reviews on the proof of concept implementations to overcome the interoperability issues (potentially from maintainers of multiple libraries) .
Oriol Abril Pla
2022-12-02T13:00:00+00:00
13:00
02:00
Workshop/Tutorial I
cfp-217-real-world-perspectives-to-avoid-the-worst-mistakes-using-machine-learning-in-science
https://global2022.pydata.org//cfp/talk/Y9VFDD/
false
Real-world Perspectives to Avoid the Worst Mistakes using Machine Learning in Science
Workshops
en
Numerous scientific disciplines have noticed a reproducibility crisis of published results. While this important topic was being addressed, the danger of non-reproducible and unsustainable research artefacts using machine learning in science arose. The brunt of this has been avoided by better education of reviewers who nowadays have the skills to spot insufficient validation practices. However, there is more potential to further ease the review process, improve collaboration and make results and models available to fellow scientists. This workshop will teach practical lessons that can be directly applied to elevate the quality of ML applications in science by scientists.
It seems like we avoided the worst signs of the reproducibility crisis in science when applying machine learning in science. Thanks to better education for reviewers, easier access to tools, and a better understanding of zero-knowledge models.
However, there is much more potential for ML in science. The real world comes with many pitfalls that make the application of machine learning very promising, but the verification of scientific results is complex. Nevertheless, many open-source contributors in the field have worked hard to develop practices and resources to ease this process.
We discuss pitfalls and solutions in model evaluation, where the choice of appropriate metrics and adequate splits of the data is important. We discuss benchmarks, testing, and machine learning reproducibility, where we go into detail on pipelines. Pipelines are a great showcase to avoid the main reproducibility pitfalls, as well as, a tool to bridge the gap between ML experts and domain scientists. Interaction with domain scientists, involving existing knowledge, and communication are a constant undercurrent in producing trustworthy, validated, and reliable machine learning solutions.
Overall, this workshop relies on existing high-quality resources like the Turing Way, more applied tutorials like Jesper Dramsch’s Euroscipy tutorial on ML reproducibility, and professional tools like the Ersilia Hub. Where we utilize real-world examples from different scientific disciplines, e.g. weather and biomedicine.
In this workshop, we present a series of talks from invited speakers that are experts in the application of data science and machine learning to real-world applications. Each talk will be followed by an interactive session to take the theory into practical examples the participants can directly implement to improve their own research. Finally, we close on a discussion that invited active participation and engagement with the speakers as a group.
## Schedule
| **Time** | **Topic** | **Speaker** |
|---|---|---|
| 5 min | Opening of workshop | Jesper Dramsch |
| 20 min | Why and how make ML reproducible? | Jesper Dramsch |
| 25 min | Evaluating Machine Learning Models | Valerio Maggio |
| 10 min | ML for scientific insight | Mike Walmsley |
| 10 min | Break & Chat | |
| 10 min | Testing in Machine Learning | Goku Mohandas |
| 25 min | Integrating ML in experimental pipelines | Gemma Turon |
| 10 min | Discussion & Audience Questions | All Speakers |
| 5 min | Closing | Jesper Dramsch|
## Speakers
- Jesper Dramsch, [ECMWF](https://www.ecmwf.int) [[🌐](https://dramsch.net) [💻](https://github.com/JesperDramsch) [👔](https://www.linkedin.com/in/mlds/)]
- Valerio Maggio, [Anaconda Inc.](https://www.anaconda.com/) [[💻](https://github.com/leriomaggio) [👔](https://www.linkedin.com/in/valeriomaggio/)]
- Goku Mohandas, [madewithml.com](https://madewithml.com) [[🌐](https://madewithml.com/) [💻](https://github.com/GokuMohandas) [👔](https://www.linkedin.com/in/goku/)]
- Gemma Turon, [Ersilia](https://www.ersilia.io/) [[💻](https://github.com/GemmaTuron) [👔](https://www.linkedin.com/in/gemma-turon/)]
- Mike Walmsley, University of Manchester [[🌐](https://walmsley.dev/) [💻](https://github.com/mwalmsley) [👔](https://www.linkedin.com/in/m1kewalmsley/)]
## Talks
### Overview Talk: Why and how make ML reproducible? (Jesper Dramsch)
The overview talk serves to set the scene and present different areas where researchers can increase the quality of their research artefacts that use ML. These increases in quality are achieved by using existing solutions to minimize the impact these methods take on researcher productivity.
This talk loosely covers the topics Jesper discussed in their Euroscipy tutorial which will be used for the interactive session here:
[https://github.com/JesperDramsch/euroscipy-2022-ml-for-science-reproducibility-tutorial](https://github.com/JesperDramsch/euroscipy-2022-ml-for-science-reproducibility-tutorial)
Topics covered:
1. Why make it reproducible?
2. Model Evaluation
3. Benchmarking
4. Model Sharing
5. Testing ML Code
6. Interpretability
7. Ablation Studies
These topics are used as examples of “easy wins” researchers can implement to disproportionately improve the quality of their research output with minimal additional work using existing libraries and reusable code snippets.
### Evaluating Machine Learning Models (Valerio Maggio)
In this talk, we will introduce the main features of a **Machine Learning (ML) Experiment**. In the first part, we will first dive into understanding the benefits and pitfalls of common evaluation metrics (e.g. accuracy VS F1 score), whilst the second part will be mainly focused on designing reproducible and (statistically) robust evaluation pipelines.
The main lessons learnt and takeaway messages from the talk will be showcased in an interactive tutorial.
### ML for scientific insight (Mike Walmsley)
Building ML models is easy; answering science questions with them is hard. This short talk will introduce common issues in applying ML, illustrated with real failures from astronomy and healthcare - including some by the speaker. We hope sharing the lessons learned from these failures will help participants build useful models in their own field.
### Testing in Machine Learning (Goku Mohandas)
What is testing ML and how it's different from testing deterministic code
Why it's important to test ML artifacts (data + models)
What testing data and testing models looks like (and I'll provide quick code snippets so people can see what it looks like)
Concluding thoughts on how testing relates to monitoring and continual learning.
### Integrating ML in experimental pipelines (Gemma Turon)
This talk will focus on the implementation of ML models to actual experimental pipelines. We will review strategies for sharing pre-trained models that can be readily adopted by non-expert users, and thow to bridge the gap between dry-lab and wet-lab researchers, with case studies in the field of biomedicine. The interactive tutorial will exploit one of such pretrained open source model hub repositories, the Ersilia Model Hub.
Additionally, here are some papers for further reading for the interested:
## Resources
* Model evaluation: [https://arxiv.org/abs/1811.12808](https://arxiv.org/abs/1811.12808)
* General Reproducibility: [https://arxiv.org/abs/2003.12206](https://arxiv.org/abs/2003.12206)
* ML Pitfalls: [https://arxiv.org/abs/2108.02497](https://arxiv.org/abs/2108.02497)
/media/cfp/submissions/Y9VFDD/Ampersands-ML-workshop_5GOHMsT.png
Jesper DramschValerio MaggioGemma TuronMike WalmsleyGoku Mohandas
2022-12-02T16:00:00+00:00
16:00
02:00
Workshop/Tutorial I
cfp-198-data-annotation-for-humans-creating-and-refining-annotation-guidelines-from-a-ux-perspective
https://global2022.pydata.org//cfp/talk/BHZWDC/
false
Data annotation for humans: Creating and refining annotation guidelines from a UX perspective
Workshops
en
In this workshop, attendees will learn how to create data annotation guidelines from a user experience (UX) perspective.
Creating annotation guidelines from a UX perspective means imbuing them with usability, resulting in a better experience for annotators, and more effective and productive annotation campaigns. With Python being at the forefront of Machine Learning and data science, we believe that the Python community will benefit from learning more about the design of data annotation guidelines and why they are essential for creating great machine learning applications.
**What is this workshop about?**
Data annotation, or the process of adding structured information to raw data, is the invisible star player of many machine learning tasks such as sentiment analysis, named-entity recognition, and question answering, among others. If annotations are low-quality, they may be costly to redo as manual work can be expensive and time-consuming. Annotation guidelines, in turn, are technical documents meant to be used by annotators to define the task at hand, help resolve challenging annotation cases, and in general, direct annotation efforts across annotators for consistency and reliability. These documents are an essential component of many machine-learning projects, and there is evidence that they can indirectly impact annotation outcomes.
Data annotation can be complex and labor-intensive. Annotators rely heavily on annotation guidelines as they spend tens or even hundreds of hours in front of a computer labeling data. Despite their importance, annotation guidelines are not always designed with usability in mind; they often lack adequate structure, are difficult to navigate, and their wording can be ambiguous or confusing. As we will see, these shortcomings can directly impact annotation quality and create unnecessary stress and frustration for both annotators and annotation managers.
In this workshop, attendees will learn the basics of human-centered design and how we can apply UX research techniques to create robust and easier-to-use annotation guidelines. The workshop will include one theoretical and two practical components. The theoretical component will deal with the basic outline of language annotation guidelines, show examples of dry or difficult-to-use annotation guidelines, and introduce UX concepts and techniques that can help us rethink how to design better annotation guidelines.
The practical components will deal with the hands-on creation and testing of annotation guidelines. Attendees will experience both sides of the annotation task: the managerial side and the annotators’ side. For data annotation, we will provide attendees with remote access to Prodigy for the workshop duration. Prodigy is a commercial Python-based tool for data annotation. However, none of the content in this workshop depends on a particular tool, and attendees should be able to apply their newly acquired skills and best practices with whichever tool they choose to work with in the future.
After this workshop, attendees can expect a working knowledge of several UX research techniques and practical experience in creating and improving annotation guidelines.
**Curriculum**
Introduction:
- Topic #1: Why is data annotation important? (Theoretical)
- Topic #2: What is it like to be an annotator? (Hands-on with data annotation tool)
- Expected duration: 15 minutes
Part one: Annotation guidelines and UX design (Theoretical with group activities)
- Expected duration: 25 minutes
Part two: Defining data annotation tasks. A practical perspective. (Hands-on)
- Expected duration: 15 minutes
1-hour mark
- 5-minute recess + 5-minute QA
Part three: Annotation guidelines re-design. Put your new UX knowledge into practice. (Hands-on)
- Expected duration: 20 minutes
Part four: Iterative reliability testing: using annotator agreement to finalize your guidelines
- Expected duration: 15 minutes
Final thoughts and QA
- Expected duration: 10 minutes
**Requirements**
- No computer programming knowledge or language annotation experience is required.
- Attendees experienced in data annotation will still find benefit from learning how to apply UX principles to the creation of annotation guidelines.
- Workshop materials will be distributed during the workshop and made available publically via the "resources" after the workshop.
**About us**
The facilitators are computational linguists, machine learning engineers, and NLP practitioners with experience creating annotation guidelines for academia and industry.
Damian Romero
2022-12-02T19:00:00+00:00
19:00
01:30
Workshop/Tutorial I
cfp-59-simulations-in-python-discrete-event-simulation-with-simpy
https://global2022.pydata.org//cfp/talk/U7ZHRW/
false
Simulations in Python: Discrete Event Simulation with SimPy
Tutorial
en
Add to your machine learning arsenal with an introduction to simulation in Python using SimPy! Simulations are increasingly important in machine learning, with applications that include simulating the spread of COVID-19 to make decisions about public policy, vaccination and shutdowns.
You can use simulation to answer questions like, Can you increase profits by adding more tables or staff to your restaurant? You can also use simulation to create data for modeling when it's hard or impossible to get (e.g. simulate purchases in response to promotions on certain products to see if they increase sales).
To benefit from this talk, you'll need to know a small amount of Python, specifically how to write functions and simple classes. No previous knowledge of simulation needed! If you know about simulation in another language and want to see a SimPy example, you can also benefit from this talk. You'll get a Jupyter notebook with a simple but fully worked out example to follow along with and to study on your own time after the conference.
Discrete event simulation (DES) allows you to study the behavior of a process or system over time. Simulations are used to study the effects of process changes (e.g. what happens to wait times if you increase/decrease the number of call center agents working at a given time) and to create data for modeling when it's hard or impossible to get (e.g. simulate purchases in response to promotions on certain products to see if they increase sales).
In this tutorial, you'll be quickly and efficiently introduced to the basics of simulation through a simple but fully worked out example in SimPy, a popular package for DES in Python. You'll learn about event handling and the priority queue. You'll be able to run a simulation and record metrics like wait times to enable the improvement of real-life processes.
To get the most out of the talk, you should be comfortable with writing basic code in a Jupyter notebook environment. This includes knowing how to write functions and (basic) classes. If you're not super comfortable with writing classes, you can still follow along with most of the simulation logic and can then study the worked out example on your own later.
We'll spend about 20 minutes getting up to speed with discrete event simulation using a mocked up non-code example. Here we'll introduce concepts like entities, resources, events, states and queues.
We'll then spend 60 minutes working through an already written SimPy example. We'll introduce SimPy concepts such as the priority queue and event handling. Being already familiar with Python concepts like functions, classes and instantiating objects from user-defined classes will make the code easier to follow (instantiating objects just means creating a new object of the type of your handmade class). You'll get a link to a Jupyter notebook with the code that you can either download from GitHub or run on Google Colab.
Lara Kattan
2022-12-02T10:00:00+00:00
10:00
01:30
Workshop/Tutorial II
cfp-70-data-visualisation-with-seaborn
https://global2022.pydata.org//cfp/talk/AMGU99/
false
Data visualisation with Seaborn
Tutorial
en
Want to create beautiful and complex visualisations of your data with concise code? Look no further than Seaborn, Python’s fantastic plotting library which builds on the hugely popular Matplotlib package. This hands-on tutorial will provide you with all the necessary tools to communicate your data insights with Seaborn.
Over the last few decades, a plethora of Python packages have been developed to tackle a range of data visualisation problems. This tutorial will provide a hands-on introduction to Seaborn, a fantastic open-source plotting library that builds on the Matplotlib package. Seaborn allows complex data visualisations to be created simply and easily, whilst also improving on the default look and feel of Matplotlib figures.
Topics covered:
- Overview of Seaborn’s data visualisation functions (kernel density estimation, bivariate distributions, regression models, etc)
- Creating multi-plot grids with Seaborn
- Customising Seaborn plots (time permitting)
By the end of this session, you will be able to:
- use a range of techniques for communicating your data insights
- create beautiful graphics with concise code
- compose complex visualisations as well as standard displays such as scatter plots, histograms, boxplots and more.
No previous experience in Seaborn is necessary for this tutorial. However, basic familiarity with Pandas DataFrames and plotting with Matplotlib would be useful.
This tutorial will be run using an online environment with all the dependencies and libraries pre-installed.
The tutorial materials and slides are available at https://github.com/jumpingrivers/2022-pydata-global-seaborn.
To enable a prompt start, please follow the link to the welcome page (found on the first slide) and enter your email address and the master password (also on the first slide). This will generate a personal username and password for you to access the online training environment. Don't worry if you lose your username/password. Re-submitting your email will generate the same login details each time.
/media/cfp/submissions/AMGU99/seaborn_logo_vd4GJCH.png
Myles MitchellParisa Gregg
2022-12-02T12:00:00+00:00
12:00
01:30
Workshop/Tutorial II
cfp-27-ipyvizzu-story-a-new-open-source-charting-tool-to-build-create-and-share-animated-data-stories-with-python-in-jupyter
https://global2022.pydata.org//cfp/talk/ZKRAQY/
false
ipyvizzu-story - a new, open-source charting tool to build, create and share animated data stories with Python in Jupyter
Tutorial
en
Sharing and explaining the results of your analysis can be a lot easier and much more fun when you can create an animated story of the charts containing your insights. [ipyvizzu-story](https://github.com/vizzuhq/ipyvizzu-story) - a new open-source presentation tool for Jupyter & Databricks notebooks and similar platforms - enables just that using a simple Python interface.
In this workshop, one of the creators of ipyvizzu-story introduces this tool and helps the audience take the first steps in utilizing the power of animation in data storytelling. After the workshop, the members can build and present animated data stories independently.
After a brief introduction to the technology, workshop participants will see a step-by-step walkthrough of creating an interactive, animated data story within Jupyter.
Outline of the workshop:
- Introduction to Vizzu, the open-source C++/JavaScript lib behind ipyvizzu
- Key concepts when using a generic chart building and morphing engine.
- The four key parts of the animate method - the central element of ipyvizzu-story
- Adding data with and without using pandas
- Chart configuration options
- Styling
- Animation options
- Building a story (slides, steps)
- Q&A
/media/cfp/submissions/ZKRAQY/ipyvizzu_open-source_lib_for_animated_data_stories_sm_6TfPPqE.gif
Peter Vidos
2022-12-02T13:30:00+00:00
13:30
01:30
Workshop/Tutorial II
cfp-263-building-a-machine-learning-platform-with-oss-in-90-min
https://global2022.pydata.org//cfp/talk/FLQLXY/
false
Building a Machine Learning Platform with OSS in 90 min
Tutorial
en
Have you ever wondered what it takes to build a production grade Machine Learning platform? With so many OSS tools and frameworks it can get overwhelming at times how to make everything work. In this workshop we will build a production grade Model training, Model Serving, Model Monitoring platform on AWS EKS. Nothing will be local. These ideas can serve ML Engineers, Applied Data Scientists & Researchers to further extend them and develop a holistic picture of building an ML Platform on OSS.
Building a Machine Learning Platform takes many months and years of effort from multiple engineers. There are several components of ML Platform e.g. an environment to explore and train models, design pipelines that are reproducible, store model artifacts, deploy models on scalable infrastructure such as kubernetes, once deployed monitor the models, automated retraining etc. I have been building some of these components as a ML platform Engineer for over 3 years. I am passionate about learning and using OSS to build ML Platform components on Kubernetes and share with the community. While 90 minutes will never be enough to build each and every component, however in this talk we will share with Engineers and Scientists how we can build some of the critical pieces using OSS. We will walk through a use case of setting up a training pipeline, experiment tracking, model registry & model deployer on AWS EKS. We will be solely using OSS frameworks running on Kubernetes on AWS. It will be mostly a live demo and codes and instructions will be shared as a GitHub repo and Instruction document.
Anindya Saha
2022-12-02T16:00:00+00:00
16:00
02:00
Workshop/Tutorial II
cfp-40-too-much-data-when-big-data-starts-to-become-a-bad-idea
https://global2022.pydata.org//cfp/talk/GEULJ9/
false
Too much data? When big data starts to become a bad idea
Workshops
en
Nowadays we know the social media and tech giants are honesting tons of data from their users and most of us agree that the capability of these companies to deliver their suggestions and customization for you is driven by big data.
However, this brings a question: Is more data always better? Do more data equal to more accurate model? When do you need big data and when does it start becoming a bad idea? Let's find out in this panel session.
Nowadays we know the social media and tech giants are honesting tons of data from their users and most of us agree that the capability of these companies to deliver their suggestions and customization for you is driven by big data.
However, this brings a question: Is more data always better? Do more data equal to more accurate model? When do you need big data and when does it start becoming a bad idea?
While big data seems like a golden ticket to solving all the problems, it also requires more resources to manage and make use of it. Also, more data means does not mean better data quality, it may be the opposite, the more data means it is harder to maintain good data quality.
Being able to judge how much data we want and knowing when to stop seems like a thing that many of us overlook. In this panel session, we will invite leaders in the industry to discuss whether or not we should always aim for more data and if not, when to stop. We will also talk about how to judge how much data we need and what to do if we have too much or too little data.
Panelists:
- Katrina Riehl, PhD - Head of Streamlit Data Team @ Snowflake, President of Board of Directors @ NumFOCUS
- Jesper Sören Dramsch
- Alexander Hendorf
- John Sandall - CEO & Principal Data Scientist @ Coefficient
Cheuk Ting HoJesper DramschAlexander CS HendorfKatrina RiehlJohn Sandall
2022-12-02T19:00:00+00:00
19:00
01:30
Workshop/Tutorial II
cfp-213-missing-data-in-the-age-of-machine-learning
https://global2022.pydata.org//cfp/talk/8DKYRH/
false
Missing Data in the Age of Machine Learning
Tutorial
en
Machine learning algorithms, especially artificial neural networks, are not tolerant of missing data. Many practitioners simply remove records with missing fields without any consideration for the potential statistical bias that might be introduced. The field of imputation has become mature with imputations not only predicting missing values, but reflecting the uncertainty in the prediction. Traditional statistical estimators make use of the full benefits offered by advanced imputation techniques. This tutorial illustrates techniques and architectures that can incorporate advanced imputation techniques into machine learning pipelines including artificial neural networks.
* The tutorial illustrates how artificial neural networks can be used as a missing data imputer. The description of these neural network imputers illustrates how susceptible bias in the treatment of missing values can affect a neural network model. The tutorial then illustrates how imputation techniques can be incorporated into a machine learning pipeline. Finally, several architectures and pipeline techniques are demonstrated showing how to incorporate the statistical uncertainty present in imputed data into the training and operation of a machine learning model from techniques such as multiple imputation.
* This tutorial is intended for participants with an intermediate level understanding of Python and basic understanding of Pandas. Some knowledge of SciKit-Learn, Keras and Tensorflow is useful but not required. The majority of the tutorial uses Jupyter notebooks, so participants should be comfortable in that environment.
**Curriculum outline is as follows:**
* Neural Network Autoencoder Imputers
* Participants will apply denoising autoencoders to simple imputation tasks.
* Autoencoder techniques require an initial imputation, various initial imputations are applied allowing the participant to see biasing effects of imputation on a neural network model.
* Building Neural Network Pipelines with Imputers
* Participants will build machine learning pipelines that address missing data. Simple example pipelines are built using scikit-learn’s pipelines.
* These pipelines accommodate the three phases of machine learning usage, training, testing and production.
* Incorporation of multiple imputation
* Multiple imputations capture uncertainty in predicted missing values, by providing a number of possible imputed values for a given missing value. Several methods of incorporating multiple imputed values are demonstrated in the tutorial.
* Methods demonstrated include using data augmentation and ensembles
Haw-minn LuHaoyin Xu
2022-12-02T22:30:00+00:00
22:30
01:00
Community Events & Sponsor Sessions
cfp-311-pydata-pub-quiz
https://global2022.pydata.org//cfp/talk/GU9AJ7/
true
PyData Pub Quiz
Talk
en
Join us for the traditional PyData Pub Quiz, hosted by quizmasters James Powell and Cameron Riddell. The event is open to everyone and will be located in Gather.
Come as a team or find one when you arrive, it’s up to you. Each table is a Gather private space so you can discuss answers with your team. Show off your knowledge and learn something new.
Quizmaster James PowellCameron Riddell
2022-12-03T08:00:00+00:00
08:00
00:30
Talk Track I
cfp-257-bon-voyage-leading-machine-learning-research-journeys-with-happy-into-production-endings
https://global2022.pydata.org//cfp/talk/GMPTUX/
false
Bon Voyage! Leading machine learning research journeys with happy (into-production) endings
Talk
en
Why is the process of transforming research into a “real world” product so full of question marks? We often know where the research journey starts but have uncertainty about how and WHEN it ends.
In this talk, I will share my own experience leading algorithmic teams through the cycle of research into the production of live-streaming AI products. I will also share how to mitigate between agile incremental delivery and giant leaps forward that require longer research. How understanding the minimum viable product (MVP) way of thinking can help not only managers but every developer. Learn to outline MVP for new AI capabilities, and move forward with production in mind, while always raising the quality standards. At the end of this session, you will get the boost you need to take the data-driven experimental mindset to the next level, spiced with methodologies you can adapt to development as well as research.
This talk will cover the pain points in machine learning productization and ways to address them.
Topaz will share insights from her journey leading AI algorithmic groups: How to keep raising your ML quality standards?
From the data-driven aspect of ML through agility development and the minimum-viable-product (MVP) approach to enable it. This talk will cover how understanding the “MVP” way of thinking can help not only managers but every developer.
Topaz Gilad
2022-12-03T08:30:00+00:00
08:30
00:30
Talk Track I
cfp-168-building-an-ml-application-platform-from-the-ground-up
https://global2022.pydata.org//cfp/talk/GQXZ93/
false
Building an ML Application Platform from the Ground Up
Talk
en
The value of an ML model is not realized until it is deployed and served in production. Building an ML application is more challenging compared to a traditional application due to the added complexities from models and data in addition to the application code. Using web serving frameworks (e.g. FastAPI) can work for the simple cases but falls short for performance and efficiency. Alternatively, using pre-packaged models servers (e.g. Triton Inference Server) can be ideal for low-latency serving and resource utilization but lacks flexibility in defining custom logic and dependency. BentoML abstracts the complexities by creating separate runtimes for IO-intensive preprocessing logic and compute-intensive model inference logic. Simultaneously, BentoML offers an intuitive and flexible Python-first SDK for defining custom preprocessing logic, orchestrating multi-model inference, and integrating with other frameworks in the MLOps ecosystem.
BentoML is an open source ML application platform that simplifies model packaging and model management, optimizes model serving workloads to run at production scale, and accelerates the creation, deployment, and monitoring of prediction services. BentoML has an active community of Data Scientists, Engineers, and ML Practitioners around the world with over 1600 members. After studying hundreds of model serving use cases in our community, we would like to share our learning and considerations that went into building and evolving BentoML. The talk begins with simple use cases with considerations of choosing programming languages and frameworks, and expands into complex requirements like performance, resource utilization, multi-model orchestration, monitoring and feedback cycles. The talk will also discuss how BentoML solves the above challenges with real world examples.
/media/cfp/submissions/GQXZ93/pydata_bentoml_UHStDN4.png
Sean Sheng
2022-12-03T09:30:00+00:00
09:30
00:30
Talk Track I
cfp-64-ml-model-traceability-and-reproducibility-by-design
https://global2022.pydata.org//cfp/talk/QVCU3E/
false
ML Model Traceability and Reproducibility by Design
Talk
en
Model traceability and reproducibility are crucial steps when deploying machine learning models. Model traceability allows us to know which version of the model generated which prediction. Model reproducibility ensures that we can roll back to the previous versions of the model anytime we want.
We, as ML engineers, designed reusable workflows which enable data scientists to follow these two principles by design.
We would like to present our reusable workflows, which can be imported and used in every data science project repository to deploy ML models into production by following MLOPs principles. This heavily depends on the tech stack we have in our organization. We mainly focus on traceability and reproducibility where we connect GitHub commit hash, Databricks run_id, and mlflow run_id to each prediction generated by the model, at each API request. In that way, we ensure the following:
For any given ML model, it is possible to look up unambiguously:
- Corresponding code/ commit on git
- Infrastructure used for training and serving
- Environment used for training and serving
- ML model artifacts
Basak EskiliMaria Vechtomova
2022-12-03T10:00:00+00:00
10:00
00:30
Talk Track I
cfp-119-implementation-and-analysis-of-deep-learning-models-for-codeswitched-speech-classification
https://global2022.pydata.org//cfp/talk/KC893G/
false
Implementation and analysis of deep learning models for codeswitched speech classification
Talk
en
Automatic Speech recognition (ASR) is used in many devices to identify Bilingual speech data. Bilingual language or in more scientific terms a code switched language is one or more languages being mixed in a speech utterance. In this presentation, learn about different deep learning techniques that can be used for the classification of such speech utterances. If you are a beginner in this field and don't know where to start, join me to explore this use case and learn something new!
Motivation: Code-Switching occurs when a speaker alternates between two or more languages, or language varieties, in the context of a single conversation or situation. In the Automatic Speech Recognition(ASR) which is used in many virtual assistants like Alexa & Siri, Code-switching is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. Though all of this looks good in the news feed and tech newsletter it is important to dive deep and try to understand how all of this happens on a ground level, by implementing such use cases in the basic stages through personal projects.
Problem Statement:
Speech Recognition (ASR) is widely used in mobile and personal devices. In countries like India, the data content from a provider can be in many languages(Hindi, Tamil, Gujarati, etc). Indian speakers tend to code-switch(CS) (change a language) while speaking most of the time. The goal here is to discuss the two deep learning techniques using Convolutional Neural Network and Recurrent Neural Network to classify between English, Hindi, Code switched speech at utterance level.
Results/Conclusion:
Attendees will gain an understanding of the two main deep learning approaches which were used for code-switched speech classification. This session will help them first understand the problem at hand and then dive deep with solutions thus gaining wider visibility over the topic.
/media/cfp/submissions/KC893G/Yashasvi_Misra_VV6TUda.jpeg
Yashasvi Misra
2022-12-03T11:00:00+00:00
11:00
00:30
Talk Track I
cfp-149-is-it-possible-to-have-entities-within-entities-within-entities-
https://global2022.pydata.org//cfp/talk/WFCX9M/
false
Is it possible to have entities within entities within entities?
Talk
en
Named entity recognition models might not be able to handle a wide variety of spans, but Spancat certainly can! Within our open-source library for NLP, spaCy, we've created a NER model to handle overlapping and arbitrary text spans. Dive into named entity recognition, its limitations, and how we've solved them with a solution-focused talk and practical applications.
The standard approach to named entity extraction becomes problematic when dealing with a wide variety of spans, like long phrases, non-named entities, and overlapping annotations. Whereas named entities normally have clear boundaries and syntactic properties, spans can be completely arbitrary, posing a problem for some entity extraction applications.
I'll start by talking about NER models, how and why they're used, and some of their limitations. Then I'll introduce Spancat, our solution to the problem of arbitrary and overlapping spans that we've implemented in spaCy, our open-source NLP library for machine learning. You'll leave with an understanding of what named entity recognition is, how a span-labeling model works, and a real-world application to these complex problems.
/media/cfp/submissions/WFCX9M/spancat_NsP6BEw.png
Victoria Slocum
2022-12-03T11:30:00+00:00
11:30
00:30
Talk Track I
cfp-187-mixing-art-with-python-an-introduction-to-style-transfer
https://global2022.pydata.org//cfp/talk/EN3CWF/
false
Mixing art with Python: an introduction to Style Transfer
Talk
en
What would the sunset painted by van Gogh look like? And the front of your house? This is entirely possible with Deep Learning. The Neural Style Transfer technique aims to compose images in the style of another image, modifying the content and saving it at the same time.
In this lecture, the concepts of Deep Learning, neural networks, and the step-by-step to carry-out styles transfer will be introduced.
Art is a difficult concept to define. What is art for some may not be art for others and this concept changes over time, with something that was not considered art becoming art.
If we define art as a set of techniques in order to express something, can we expand this activity to machines? Can a work made by an algorithm be considered art?
The transfer of styles is a field that brings these questions to the fore, as it allows the creation of works from the styles of another (hence its name). The results can be sensational works, giving wings to the imagination
Isac Moura Gomes
2022-12-03T12:00:00+00:00
12:00
00:30
Talk Track I
cfp-212-a-practical-approach-to-unlock-value-from-data-and-analytics
https://global2022.pydata.org//cfp/talk/ADH33C/
false
A Practical Approach To Unlock Value From Data and Analytics
Talk
en
Abstract
There are many stories about Data Science hires that end up working in silos, buried in ad hoc business requests. According to Gartner, only 20% of analytic insights will deliver business outcomes in 2022. And a large number of Machine Learning Models never go to production. On top of that, work satisfaction among data professionals is staggeringly low; for instance, 97% of data engineers reported feeling burnt out in a 2021 Wakefield Research Survey. Furthermore, despite living in the era of information, many business executives are making decisions based on guesswork because of the need for more relevant data access in a timely fashion. This talk covers why many data initiatives fail and, more importantly, how to prevent it. I lay out a number of practical approaches based on work experience that will help you to unlock the potential of data and analytics — from how to build the case and gain buy-in to promoting a fact-based decision-making culture. This talk is for you if you are a business leader sponsoring data initiatives, if you work in data applications, or if you would benefit from enhanced analytics.
There is no doubt that unlocking value from data and analytics has a direct impact on business performance metrics, including profitability and customer satisfaction. Furthermore, in an uncertain world, the ability of businesses to harness their data potential is more crucial than ever. Nevertheless, delivering successful data-driven initiatives is not as straightforward as it may appear. This talk will focus on actions you can take to extract value from data and analytics.
Maria Feria
2022-12-03T12:30:00+00:00
12:30
00:30
Talk Track I
cfp-154-what-if-causal-reasoning-meets-bayesian-inference
https://global2022.pydata.org//cfp/talk/FQBSP8/
false
What-if? Causal reasoning meets Bayesian Inference
Talk
en
We learn about the world from data, drawing on a broad array of statistical and inferential tools. The problem is that causal reasoning is needed to answer many of our questions, but few data scientists have this in their skill set. This talk will give a high-level introduction to aspects of causal reasoning and how it is complemented by Bayesian inference. A worked example will be given of how to answer what-if questions.
## Core objectives:
- Make the case that causal reasoning is required to answer many important questions in research and business.
- Flesh out how causal reasoning and Bayesian inference complement each other.
- Convey how some what-if questions can be answered using Synthetic Control methods.
- Illustrate how to use Synthetic Control methods in practice with a worked example with Python code snippets (using PyMC) and empirical results.
- Introduce the new Python package [CausalPy](https://github.com/pymc-labs/CausalPy).
The talk will be a high-level overview, with very few (if any) equations. Rather, I focus on conveying the intuition and practical steps to answer what-if questions through concrete examples. I will provide references for those wishing to flesh out their understanding after the talk. This talk is aimed at a broad audience - anyone wanting to learn about the causal structure of the world, whether for fun or profit. Knowledge of causal inference is not assumed, but a beginner to intermediate knowledge of data science would be beneficial. Some familiarity with Bayesian methods would be beneficial, but are not required.
## Talk structure:
- I will provide an overview of ‘what-if?’ questions including: “What would have happened to this patient if they had taken the drug rather than the placebo?” or “How much did an advertising campaign drive the change in user sign-ups?”
- Establish why we cannot solve our problems with traditional statistical and data science methods in the absence of causal reasoning.
- Describe how causal reasoning questions are complemented by the Bayesian approach, namely quantifying our uncertainty, and a focus on parameter estimation instead of hypothesis testing with p-values.
- One main example will focus on how to approach the question “How did Brexit causally affect the United Kingdom’s GDP despite this not being a randomized experiment?” I will intuitively explain how the Synthetic Control method works (by creating a synthetic United Kingdom as a weighted sum of other countries unaffected by Brexit) and how we can implement this, with PyMC code snippets.
- I will summarize by: a) outlining the bounds of Synthetic Control and when other approaches are called for, b) highlight available Python and R packages (CausalImpact, tfcausalimpact, GeoLift, and a PyMC-based solution), and c) providing further reading and learning resources.
## References
- Cunningham, Scott. "Causal inference." Causal Inference. Yale University Press, 2021
- Huntington-Klein, N. (2021). The effect: An introduction to research design and causality. Chapman and Hall/CRC.
- Facure, M (2021) Causal Inference for The Brave and True, https://github.com/matheusfacure/python-causality-handbook
## GitHub repository
A supporting GitHub repository, with notebooks, can be found at [drbenvincent/pydata-global-2022](https://github.com/drbenvincent/pydata-global-2022).
Benjamin Vincent
2022-12-03T13:30:00+00:00
13:30
00:30
Talk Track I
cfp-235-probabilistic-demand-forecasting-at-scale
https://global2022.pydata.org//cfp/talk/EKQXW9/
false
Probabilistic demand forecasting at scale
Talk
en
It’s common to hear about demand forecasting in the e-commerce ecosystem. Indeed, It plays a pivotal role in logistics and inventory applications. However, due to uncertainty impacting demand and the stochastic nature of most downstream applications, the need for probabilistic demand forecasting emerges. Moreover, for the most realistic use cases, you’ll have to forecast for thousands if not hundreds of thousands of time series. The problem we will explore together is: how can we get probabilistic forecasts that embrace uncertainty and scale?
The talk is light-hearted, contains few math formulas, and is aimed at forecasting practitioners! If you are new to the topic of forecasting, you'll be able to follow! We take the time to pose the problems and develop deeper from there.
It’s common to hear about demand forecasting in the e-commerce ecosystem. Indeed, It plays a pivotal role in logistics and inventory applications. However, due to uncertainty impacting demand and the stochastic nature of most downstream applications, the need for probabilistic demand forecasting emerges. Moreover, for the most realistic use cases, you’ll have to forecast for thousands if not hundreds of thousands of time series. The problem we will explore together is: how can we get probabilistic forecasts that embrace uncertainty and scale?
The talk will cover:
- The problem of demand forecasting in the context of e-commerce: the need for demand forecasting, the importance of clearly defining what actually “demand” means, the curse of lost sales, and the factor of uncertainty.
- How can we capture uncertainty impacting demand? What types of data points and feature engineering are considered?
- A review of deterministic forecasting and its limitations in the context of e-commerce demand: we will discuss what we mean by “deterministic”? What models to consider, and why despite being traditionally used in the industry it does not serve the purpose of demand forecasting?
- How to embrace uncertainty with probabilistic forecasting? How can a given model architecture shift towards the realm of probabilistic? What metrics and training techniques are pre-dominant at the moment?
- An example of training and inference pipelines from a real-world industry application that scale to hundreds of thousands of time series.
Hagop Dippel
2022-12-03T14:00:00+00:00
14:00
00:30
Talk Track I
cfp-25-scalable-feature-engineering-with-hamilton
https://global2022.pydata.org//cfp/talk/93N8LR/
false
Scalable Feature Engineering with Hamilton
Talk
en
In this talk we present Hamilton, a novel open-source framework for developing and maintaining scalable feature engineering dataflows. Hamilton was initially built to solve the problem of managing a codebase of transforms on pandas dataframes, enabling a data science team to scale the capabilities they offer with the complexity of their business. Since then, it has grown into a general-purpose tool for writing and maintaining dataflows in python. We introduce the framework, discuss its motivations and initial successes at Stitch Fix, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.
At Stitch Fix, a data science team’s feature generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel open source Python framework that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler approach for data science & data engineering teams to create, maintain, execute, and scale both the human and computational sides of feature/data transforms.
At a high level, we will cover:
- What Hamilton is and why it was created
- How to use it for feature engineering
- The software engineering best practices Hamilton prescribes that make pipelines more sustainable
- How Hamilton enables out-of-the-box scaling with common distributed compute frameworks
At a low level, you will learn:
- How a data science team at Stitch Fix scaled their team and code base with Hamilton to enable documentation-friendly, unit-testable code
- What Hamilton is and how the declarative paradigm it prescribes offers advantages over more traditional approaches
- How you can easily add runtime data quality checks to ensure the robustness of your pipeline
How the Ray/Dask/Spark integrations with Hamilton works and how they can help you scale your data
Elijah ben IzzyStefan Krawczyk
2022-12-03T14:30:00+00:00
14:30
00:30
Talk Track I
cfp-178-media-mix-modeling-how-to-measure-the-effectiveness-of-advertising-in-python
https://global2022.pydata.org//cfp/talk/T8WU9K/
false
Media Mix Modeling: How to Measure the Effectiveness of Advertising in Python
Talk
en
Media Mix Modeling, also called Marketing Mix Modeling (MMM), is a technique that helps advertisers to quantify the impact of several marketing investments on sales.
If a company advertises in multiple media (TV, digital ads, magazines, etc.), how can we measure the effectiveness and make future budget allocation decisions? Traditionally, regression modeling has been used, but obtaining actionable insights with that approach has been challenging.
Recently, many researchers and data scientists have tackled this problem using Bayesian statistical approaches. For example, Google has published multiple papers about this topic.
In this talk, I will show the key concepts of a Bayesian approach to MMM, its implementation using Python, and practical tips.
*Agenda*
- Introduction
- What is Media Mix Modeling?
- Data Preparation
- Modeling : Bayesian approach with Carryover & Shape Effect proposed by [Google](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46001.pdf)
- Demo with LightweightMMM
- Insights and Actions
- Summary
*Key Takeaways*
- You will understand the key concepts and approaches of Media Mix Modeling.
- You will learn how to build Bayesian models using Python for media spend optimization.
*Target Audience*
- Data analysts and data scientists who are interested in marketing and advertising.
- Data analysts, data scientists, data engineers, software developers, or other IT specialists who want to collaborate with marketing teams more effectively.
- Marketers or executives who want to improve media spending efficiency.
The following knowledge is preferred to get the most out of this talk :
- A basic understanding of Python
*Slide*
https://docs.google.com/presentation/d/1pPra3eLJ9-lYwwvx8_Ivj_sj3V2gmE75cb13comV9pc/edit?usp=sharing
*Demo Code*
https://github.com/takechanman1228/mmm_pydata_global_2022/blob/main/simple_end_to_end_demo_pydataglobal.ipynb
*Reference*
- [Jin, Y., Wang, Y., Sun, Y., Chan, D., & Koehler, J. (2017). Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects. Google Inc.](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46001.pdf)
- [Chan, D., & Perry, M. (2017). Challenges and Opportunities in Media Mix Modeling.](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45998.pdf)
- [LightweightMMM](https://github.com/google/lightweight_mmm) : A lightweight Bayesian Marketing Mix Modeling (MMM) library (Python)
- [Robyn](https://github.com/facebookexperimental/Robyn) : An experimental, automated and open-sourced Marketing Mix Modeling (MMM) package from Facebook Marketing Science (R)
- [sibylhe/mmm_stan](https://github.com/sibylhe/mmm_stan) : Python/STAN Implementation of Multiplicative Marketing Mix Model
Hajime Takeda
2022-12-03T15:00:00+00:00
15:00
00:30
Talk Track I
cfp-120-a-dive-into-time-series-for-the-energy-sector
https://global2022.pydata.org//cfp/talk/9ANXPG/
false
A dive into time series for the energy sector
Talk
en
The energy sector has gained great attention in 2022 due to the current global energy crisis. Understanding which technologies and techniques are suitable for this sector is crucial to guarantee an effective transition to a future with cleaner and efficient energy sources. This talk aims to educate tech professionals interested in the applications of machine learning in the energy sectors, especially when it comes to time series analysis and forecasting. The audience is expected to have a basic understanding of data science and machine learning, and will be introduced to the concepts of time series, as well as the most common techniques utilized in the sector.
The session is planned to have the following parts:
- Introduction to the topic of energies
- Introduction to time series
- Explanation of the need of time series techniques in the energy domain
- Introduction to use cases of time series applied to the energy domain:
A. Public engagement: example of social media behavior in relation to the energy transition with NLP
B. Safety & Monitoring: examples of computer vision and geospatial analysis over time
C. Profiling consumers: example of time series analysis and unsupervised learning
D. Forecasting Supply and Demand of Energy: examples of energy consumption and production forecast and highlights on relevant techniques such as feature engineering
All examples will be explained in a high level, but still mentioning important concepts and relevant libraries to the audience to try out later on.
Rosana de Oliveira Gomes
2022-12-03T15:30:00+00:00
15:30
00:30
Talk Track I
cfp-18-speed-up-python-data-processing-with-vectorization
https://global2022.pydata.org//cfp/talk/GTV9MP/
false
Speed up Python data processing with vectorization
Talk
en
You need to quickly process a large amount of data—but running Python code is slow.
Libraries like NumPy and Pandas bridge this performance gap using a technique called vectorization.
In order take full advantage of these libraries to speed up your code, it's helpful to understand what vectorization means and when and how it works.
In this talk you'll learn what vectorization means (there's 3 different definitions!), how it speeds up your code, and how to apply it to your code.
You need to quickly process a large amount of data—but running Python code is slow.
To help bridge this performance gap, the scientific and data science Python communities have built libraries like NumPy and Pandas that speed up computation using a technique called vectorization: batch APIs with fast native processing.
In order to take full advantage of these libraries to speed up your code, it's helpful to understand what vectorization means and when and how it works.
That way you can make sure you're use the fast path, and avoiding code patterns that slow down your code.
In this talk you'll learn:
* The three definitions of vectorization: API design, native batch processing, and SIMD.
* How vectorization allows your code to run multiple orders of magnitude faster.
* How to identify both vectorized code, and code that will run slowly by breaking vectorization.
* How to turn slow code into fast vectorized code.
The talk presumes some minimal experience with NumPy or Pandas, and most of the examples will involve NumPy.
However, the same principles apply to Pandas as well, and more broadly to many other data processing libraries as well as databases.
Itamar Turner-Trauring
2022-12-03T16:30:00+00:00
16:30
00:30
Talk Track I
cfp-214-developing-battery-materials-with-python
https://global2022.pydata.org//cfp/talk/7W3ZE8/
false
Developing Battery Materials with Python
Talk
en
The electrochemical battery is one of the most important technologies for a renewable future. In this beginner-friendly talk, we will walk through how fundamental quantum mechanics and data science inform how we fine-tune battery materials for higher performance. We will also show how we used these techniques to computationally model a lithium-oxygen battery in Python.
To design better batteries we need to understand the stuff – or matter – they are made out of. But how does one gather understanding of matter? To put it simply, by understanding its electronic structure at the fundamental quantum level. In this talk we will talk about the field of computational materials design for batteries from scratch (no knowledge of physics required!).
This talk will cover:
– Description of the electronic problem – the fundamental building block we must understand
– Brief introduction of density functional theory (DFT) – the model we use to learn fundamental quantum mechanical phenomena that are relevant for battery materials design
– How Python and data science can be used to create DFT programs
– Toy example of modelling a battery
– How machine learning is changing DFT and battery research
Gabriel Birnbaum
2022-12-03T17:00:00+00:00
17:00
00:30
Talk Track I
cfp-271-reproducible-publications-with-python-and-quarto
https://global2022.pydata.org//cfp/talk/XGJUPX/
false
Reproducible Publications with Python and Quarto
Talk
en
Quarto is an open-source scientific and technical publishing system that builds on standard markdown with features essential for scientific communication. The system has support for reproducible embedded computations, equations, citations, crossrefs, figure panels, callouts, advanced layout, and more. In this talk we'll explore the use of Quarto with Python, describing both integration with IPython/Jupyter and the Quarto VS Code extension.
Quarto is an open-source scientific and technical publishing system built on Pandoc.
Uses can author Jupyter notebooks or documents as plain text markdown with code in Python, R, Julia or Observable. Quarto includes the ability to publish high-quality articles, reports, presentations, websites, blogs, and books in HTML, PDF, MS Word, ePub, Reveal.jsand more.
Tom Mock
2022-12-03T17:30:00+00:00
17:30
00:30
Talk Track I
cfp-256-classification-through-regression-unlock-the-true-potential-of-your-labels
https://global2022.pydata.org//cfp/talk/77DGQX/
false
Classification Through Regression: Unlock the True Potential of Your Labels
Talk
en
"Is a lion closer to being a giraffe or an elephant?"
It is not a question anyone asks.
So why address that classification problem the same as you would classification of age groups or medical condition severity?
The talk will walk you through a review of regression-based approaches for what may seem like classification problems. Unlock the true potential of your labels!
In the words of David Mumford: "The world is continuous, but the mind is discrete."
We often define categories when breaking down a real-world problem into an ML-based solution. However, real target values may be continuous or at least ordered. This is something to consider and even leverage in the design of your ML model. Are you facing what seems as a classification problem? Take a moment to understand the hidden relations between your “classes”.
Topaz will share practical tips derived from her own experience leading AI algorithmic groups and will cover an overview of approaches that may boost your classifiers!
Topaz Gilad
2022-12-03T18:00:00+00:00
18:00
01:00
Talk Track I
cfp-310-keynote-pia-mancini
https://global2022.pydata.org//cfp/talk/KGX8RD/
false
Keynote - Pia Mancini
Keynote
en
Pia is the co-founder and CEO of Open Collective.
Pia is a Democracy activist, open source sustainer, co-founder & CEO at Open Collective, a platform that enables communities around the world to raise and spend funds in full transparency. Last year, collectives raised USD 37M, effectively unlocking access to impact funds around the world. She is also co-founder and President of The Open Source Collective, a non profit that provides a financial and admin home for +3000 open source projects around the globe granting them access to project directed funding. Pia is also co-founder and Chair of Democracy Earth Foundation, a Y Combinator backed non profit dedicated to developing technology for democracy around the world.
Pia Mancini
2022-12-03T19:00:00+00:00
19:00
00:30
Talk Track I
cfp-38-i-hate-writing-tests-that-s-why-i-use-hypothesis
https://global2022.pydata.org//cfp/talk/MHY89K/
false
I hate writing tests, that's why I use Hypothesis
Talk
en
Ok, I lied, I still write tests. But instead of the example-based tests that we normally write, have you heard of property-based testing? By using Hypothesis, instead of thinking about what data I should test it for, it will generate test data, including boundary cases, for you.
In this talk, we will explore what is property-based testing and why it can do a lot of heavy lifting in writing tests for us. As a contributor, I will introduce Hypothesis, a Python library that can help perform property-based tests with ease.
At the start of the talk, we will understand the power of property-based tests, what is it, how is it different from what we “normally do” - testing by example, and why is it useful in testing our code. This will be followed by demonstrations using Hypothesis. With a few examples, we will have a glimpse of how to create strategies - recipes for describing the test data you want to generate.
After that, we will also explore the Ghostwriters in Hypothesis which will actually write the test for you.
This talk is for Pythonistas who are new to property-based testing and found thinking of what parameters to use for testing a difficult task. This talk may provide them with a new approach to writing tests, which will be more efficient for some cases.
Cheuk Ting Ho
2022-12-03T19:30:00+00:00
19:30
00:30
Talk Track I
cfp-66-bayesian-optimization-fundamentals-implementation-and-practice
https://global2022.pydata.org//cfp/talk/UTM78E/
false
Bayesian Optimization: Fundamentals, Implementation, and Practice
Talk
en
How can we make smart decisions when optimizing a black-box function?
Expensive black-box optimization refers to situations where we need to maximize/minimize some input–output process, but we cannot look inside and see how the output is determined by the input.
Making the problem more challenging is the cost of evaluating the function in terms of money, time, or other safety-critical conditions, limiting the size of the data set we can collect.
Black-box optimization can be found in many tasks such as hyperparameter tuning in machine learning, product recommendation, process optimization in physics, or scientific and drug discovery.
Bayesian optimization (BayesOpt) sets out to solve this black-box optimization problem by combining probabilistic machine learning (ML) and decision theory.
This technique gives us a way to intelligently design queries to the function to be optimized while balancing between exploration (looking at regions without observed data) and exploitation (zeroing in on good-performance regions).
While BayesOpt has proven effective at many real-world black-box optimization tasks, many ML practitioners still shy away from it, believing that they need a highly technical background to understand and use BayesOpt.
This talk aims to dispel that message and offers a friendly introduction to BayesOpt, including its fundamentals, how to get it running in Python, and common practices.
Data scientists and ML practitioners who are interested in hyperparameter tuning, A/B testing, or more generally experimentation and decision making will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, multivariate normal distributions, etc.
Optimization of expensive black-box functions is ubiquitous in machine learning and science.
It refers to the problem where we aim to optimize a function (any input–output process) f(x), but we don't know the formula for f and can only observe y = f(x) at the locations x we specify.
Evaluating y = f(x) may also cost a lot of time and money, constraining the number of times we can evaluate f(x).
This problem of expensive black-box optimization is found in many fields such as hyperparameter tuning in machine learning, product recommendation, process optimization in physics, or scientific and drug discovery.
How can we intelligently select the locations x to evaluate the function f at, so that we can identify the point that maximizes the function as quickly as possible?
BayesOpt tackles this question using machine learning and Bayesian probability.
This talk first presents the motivation and fundamentals behind Bayesian optimization (BayesOpt) in an accessible manner.
We discuss Gaussian processes (GPs), the machine learning model used in BayesOpt, and decision-making policies that help us select function evaluations for the goal of optimization.
With the fundamentals covered, we then move on to implementing BayesOpt in practice using the state-of-the-art Python libraries, including GPyTorch for GP modeling and BoTorch for implementing BayesOpt policies.
Finally, we cover special cases in BayesOpt common in the real world, such as when function evaluations may be made in batches (batch optimization), when evaluations have variable costs depending on the input (cost-aware optimization), or when we need to balance between multiple objectives at the same time (multi-objective optimization).
Overall, this talk explains BayesOpt in a friendly way and gets you up and running with the best BayesOpt tools in Python.
# Intended audience
Data scientists and ML practitioners who are interested in hyperparameter tuning, A/B testing, or more generally experimentation and decision making will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, multivariate normal distributions, etc.
By the end of the talk, the audience will:
1. Understand the motivation behind Bayesian optimization (BayesOpt) as an optimization technique.
2. Know the main components of a BayesOpt procedure, including a predictive model (a Gaussian process) and a decision-making policy.
3. Gain practical insights into how to implement BayesOpt in Python.
4. See the various scenarios in which special forms of BayesOpt are found in the real world.
# Detailed outline
**Motivation (3 minutes)**
- Expensive, black-box optimization problems are present in many applications such as hyperparameter tuning, product recommendation, and drug discovery.
- Naïve strategies such as random search and grid search may waste valuable resources inspecting low-performance region in the search space.
- Bayesian optimization (BayesOpt) provides a method of leveraging machine learning and Bayesian decision theory to automate the search for the global optimum.
**Introducing Bayesian optimization (8 minutes)**
- BayesOpt comprises of two main components: a predictive model, commonly a Gaussian process (GP), and a decision-making algorithm called a _policy_.
The BayesOpt policy uses the GP to inform its decisions, while the GP is continually updated by the data collected by the policy, forming a virtuous cycle of optimization.
- Predictions made by a GP come in the form of multivariate normal distributions, which allow us to not only predict but also quantify our _uncertainty_ about those predictions.
This uncertainty quantification is invaluable in this problem of decision-making under uncertainty, where the cost of taking an action is high.
- A BayesOpt policy guides us towards regions in the search space that can help us find the function optimizer more quickly.
There are different BayesOpt policies, each designed with a different motivation.
We will discuss a wide range of policies ranging from improvement-based policies, policies from the multi-armed bandit problem, to policies that leverage information theory.
**Implementing Bayesian optimization in Python (8 minutes)**
- BayesOpt can be implemented in Python using a cohesive ecosystem of PyTorch for tensor manipulation, GPyTorch for implementing GPs, and BoTorch for implementing BayesOpt policies.
- GPyTorch makes implementing GPs straightforward and painless.
With GPyTorch, we can flexibly customize the components of a GP, scale GPs to large data sets, and even combine a GP with a neural network.
- BoTorch offers modular implementation of popular BayesOpt policies.
We will see that once we have defined a BayesOpt optimization loop, swapping different policies in and out is easy to do.
- Overall, the three libraries allow us to implement BayesOpt in a streamlined manner and get BayesOpt up and running in no time.
**Bayesian optimization in the real world (8 minutes)**
- The sequential nature of standard BayesOpt (choose one data point, observe its value, and repeat) is not applicable to all real-world scenarios.
We explore special variants of BayesOpt that are common in the real world.
- Batch BayesOpt is the setting where multiple function evaluations can be made at the same time.
We will discuss strategies of extending single-query BayesOpt policies to the batch setting.
- Different function evaluations could constitute different querying costs.
We will explore ways of incorporating querying costs into decision-making and develop cost-sensitive policies.
- Many real-life scenarios require optimizing more than one objective at the same time, making up multi-objective optimization problems.
We will see how BayesOpt tackles this setting by trading off the multiple objective functions at the same time.
**Q&A (3 minutes)**
Quan Nguyen
2022-12-03T20:00:00+00:00
20:00
00:30
Talk Track I
cfp-137-deep-into-the-tweet
https://global2022.pydata.org//cfp/talk/YU8VEJ/
false
Deep Into the Tweet
Talk
en
Let’s scratch the twitter meta-data together and go below the surface with tweepy. Want to find out if the tweets you follow are trying to persuade you to do things? Have the feeling the advocates for some issues use certain emotions to push you in certain directions? Now you can find out
Delve deep into the tweet with me, I’ll take you on a journey through mining twitter API, how the Twitter meta-data is structured and how to reach a specific field, analyzing text, emojis, emotions, and more. An introduction to a variety of python packages. “I show you how deep the rabbit hole goes.” (The Matrix, Morpheus) Whether or not you go down is your choice. A practical lecture, don’t worry if you don’t follow all the code lines - access to the lecture notebooks is promised.
Dina Bavli
2022-12-03T12:00:00+00:00
12:00
01:30
Talk Track II
cfp-308-lightning-talks
https://global2022.pydata.org//cfp/talk/AWYSLG/
false
Lightning Talks
Lightning Talk
en
<b>Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.</b>
<b>Order of Presentations</b>
1. Utilizing Word Embeddings and Gradient Boosting to identify, analyze, prevent and predict machine errors in the Computer Aided Manufacturing Industry by Aadit Kapoor
2. Think Inside the Box(es): Excel-Hosted Dashboards With Python Graphics by Ted Conway
3. Introducing Tilburg Science Hub: An Open Source Platform for Computational Social Science by Shrabastee Banerjee, Roshini Sudhaharan
4. Tour of UnionML: An Open Source framework for Building Machine Learning Microservices by Shivay Lamba
5. Robyn: An asynchronous Python web framework with a Rust runtime by Shivay Lamba
6. Bessel's Correction: Effects of (n-1) as the denominator in Standard deviation by SARADINDU SENGUPTA
7. Python for the Unsolvable: Machine Learning Applications in Chaos Theory by Srivatsa Kundurthy
8. Dask Powering lower end PC by Kefentse Mothusi
9. How to format strings for logging in Python by Lutz Ostkamp
10. Use pandas in tidy style by Srikanth
Aadit KapoorTed ConwayRoshini SudhaharanShrabastee BanerjeeShivay LambaSARADINDU SENGUPTASrivatsa KundurthyKefentse MothusiLutz OstkampSrikanth
2022-12-03T15:00:00+00:00
15:00
00:30
Talk Track II
cfp-317-navigating-career-adjustments-in-times-of-uncertainty
https://global2022.pydata.org//cfp/talk/M3KSXT/
false
Navigating Career Adjustments in Times of Uncertainty
Talk
en
Throughout the COVID pandemic, we’ve experienced extremes brought on by economic downturns and uncertainty across industries—to this day, we are feeling these effects around the globe. In fact, statistics show that many professionals have changed careers following the waves of layoffs that have recently occurred—but how? How can we best prepare for this type of situation, and how easy or difficult is it to change careers? If these questions have been on your mind, join this session to learn about several global industry trends, ways to adapt to career changes, and how to grow your tech skills and leverage certain platforms to support your learning process.
Throughout the COVID pandemic, we’ve experienced extremes brought on by economic downturns and uncertainty across industries—to this day, we are feeling these effects around the globe. In fact, statistics show that many professionals have changed careers following the waves of layoffs that have recently occurred—but how? How can we best prepare for this type of situation, and how easy or difficult is it to change careers? If these questions have been on your mind, join this session to learn about several global industry trends, ways to adapt to career changes, and how to grow your tech skills and leverage certain platforms to support your learning process.
Jose Mesa
2022-12-03T15:30:00+00:00
15:30
00:30
Talk Track II
cfp-276-discover-inspirational-insights-in-motivational-sports-speeches-using-speech-to-text
https://global2022.pydata.org//cfp/talk/7VWGNV/
false
Discover Inspirational Insights in Motivational Sports Speeches Using Speech-to-Text
Talk
en
Inspirational sports speeches have motivated and reinvigorated folks for years. Whether you’re a developer or an athlete, they’ve withstood the journey because even the smartest, the bravest, and the most resilient need some encouragement on occasion.
During our time together, we’ll use Python and a speech-to-text provider to transcribe sports podcasts that contain inspirational speeches. We’ll discover insights from the transcripts to determine which ones might give you a boost of energy or rally your team.
We’ll discover common topics of each sports podcast episode and measure how they leave us feeling: victorious or perhaps overcoming the agony of defeat. We’ll also investigate if there are any similarities and differences in the sports speeches and what makes a great motivational speech that moves people to action.
By the end, you’ll have a better understanding of using speech recognition in real-world scenarios and using features of Machine Learning with Python to derive insights.
This talk is for developers of all levels, including beginners.
Inspirational sports speeches have motivated and reinvigorated folks for years. Whether you’re a developer or an athlete, they’ve withstood the journey because even the smartest, the bravest, and the most resilient need some encouragement on occasion.
During our time together, we’ll use Python and a speech-to-text provider to transcribe sports podcasts that contain inspirational speeches. We’ll discover insights from the transcripts to determine which ones might give you a boost of energy or rally your team.
We’ll discover common topics of each sports podcast episode and measure how they leave us feeling: victorious or perhaps overcoming the agony of defeat. We’ll also investigate if there are any similarities and differences in the sports speeches and what makes a great motivational speech that moves people to action.
By the end, you’ll have a better understanding of using speech recognition in real-world scenarios and using features of Machine Learning with Python to derive insights.
This talk is for developers of all levels, including beginners.
Tonya Sims
2022-12-03T16:00:00+00:00
16:00
00:30
Talk Track II
cfp-315-critical-cv-nlp-data-errors-and-how-to-fix-them-with-galileo
https://global2022.pydata.org//cfp/talk/83CCLU/
false
Critical CV/NLP Data Errors and How to Fix Them with Galileo
Talk
en
Bad data is likely the largest factor limiting your model's performance. We'll talk about common data errors and how you can fix them today using Galileo. Although the majority of examples used will be in CV and NLP, the same insights apply to other modalities!
Bad data is likely the largest factor limiting your model's performance. We'll talk about common data errors and how you can fix them today using Galileo. Although the majority of examples used will be in CV and NLP, the same insights apply to other modalities!
Nikita Demir
2022-12-03T16:30:00+00:00
16:30
00:30
Talk Track II
cfp-230-imf-data-discovery-and-collection
https://global2022.pydata.org//cfp/talk/WMKJ8F/
false
IMF Data Discovery and Collection
Talk
en
The International Monetary Fund (IMF) provides a huge variety of economic datasets from different countries. We have explored the Python API for data extraction from the IMF, which allows users (primarily economists or financial analysts) to access the data. The structure of the underlying JSON datasets is quite complex for an unprepared user. In the talk, we will demonstrate the API workflow and go over the issues that we are designing a new, easier-to-use API, which is currently being developed. This is joint work with Dr. Sou-Cheng Choi (Illinois Institute of Technology and SAS Institute Inc.).
The talk is primarily directed at data analysts and economists interested in utilizing IMF's macroeconomic data.
The International Monetary Fund (IMF) is an international organization that provides financial assistance and advice to member countries. Out of 195 countries in the world, 190 are members of the IMF. Apart from advising services, the IMF collects large amounts of data on various economic indices from its member countries. The data can be accessed using a web interface or a Python API. In the summer of 2022, I worked on an internship with Prof. Sou-Cheng Choi at the Illinois Institute of Technology. While using the Python API, we realized that it is not exactly intuitive for first-time users, for whom it could easily take more than a few days to figure out the right way to download a target economic time series. The main issue is that the data is stored in layers of datasets called series, with each series containing multiple dimensions. For example, to find a country's Consumer Price Index (CPI), one would need to first discover the correct names of the containing series and dimension, followed by a text search of well-selected keywords. We will demonstrate the API so that more people can access the data. Currently, we are designing a new approach to pulling data from the IMF, which we believe will be more intuitive, especially for beginning or non-technical users such as data scientists. This presentation should be of interest to anyone who has ever had to work with international economic data. To demonstrate the API, we will be using Python and Jupyter Notebooks.
Link to the description of the API: https://datahelp.imf.org/knowledgebase/articles/1968408-how-to-use-the-api-python-and-r
Irina Klein
2022-12-03T17:00:00+00:00
17:00
00:30
Talk Track II
cfp-126-modern-analytics-in-the-cloud-a-case-for-fraud-detection
https://global2022.pydata.org//cfp/talk/8UMQRV/
false
Modern Analytics in the Cloud - A case for fraud detection
Talk
en
There’s a growing interest from small and large companies alike to move their data and their analytical pipelines into the Cloud as it adds large cost and operational benefits to businesses. Despite this, it can be unclear and sometimes confusing to know how cloud services can be used to replicate your existing analytical solutions in the Cloud or even how services can fit together to build new solutions.
The goal of this talk is to help answer these two questions. First by explaining what modern analytics look like in cloud environments and then by presenting a live use case for building an end-to-end analytical solution in the context of fraud detection for E-commerce businesses.
This talk will assume knowledge in some areas, such as the Hadoop ecosystem and the main tools used such as Airflow, Kafka, Spark, etc. an overall idea will be more than sufficient and some experience with building and deploying machine learning models (some MLOps experience). Therefore, the target audience would be data scientists/engineers with 4-5 years of experience working in analytics and/or architects looking to move their analytics solutions to the Cloud but are still unsure how it can fit together.
At the end of the talk, the audience will have a clear understanding of how modern analytics can be performed in the cloud and what a typical modern data architecture looks like. In the context of AWS, the audience will also have an understanding of the AWS analytics service offerings and what services can be used for/tailored to their needs. Finally, the audience will gain a clearer idea of how they can leverage ML capabilities to build a full pipeline in the cloud while cutting their development time by half.
The proposed outline for the talk will follow the description below:
The evolution of analytics from the 90s to current day (2-3 mins)
Modern analytics in the Cloud - what’s available (4-5 mins)
How analytics is done in the Cloud - tools to help manage the cloud solutions (5 mins)
Case study - Fraud Detection for Ecommerce (2-3 mins)
Refresher concepts (3 mins)
Breaking down the architecture (6-7 mins)
Scaling and improving the solution (5-6 mins)
The goal of the first half of the talk is to provide the audience will a solid understanding of what analytics look like in the Cloud (specifically AWS). We'll go over all the analytical solutions available and the use case for each one so it provides you with a better idea of how it can be used. The goal of the second half of the talk is to go over a live case study previously worked on to build a fraud detection model to detect fraudulent transactions for an E-commerce business. The architecture built will be explained and additional improvements to it will be discussed.
Marwa Ahmed
2022-12-03T17:30:00+00:00
17:30
00:30
Talk Track II
cfp-65-huggingface-ray-air-integration-a-python-developer-s-guide-to-scaling-transformers
https://global2022.pydata.org//cfp/talk/NZUYLM/
false
HuggingFace + Ray AIR Integration: A Python developer’s guide to scaling Transformers
Talk
en
Hugging Face Transformers is a popular open-source project with cutting edge Machine Learning (ML), but meeting the computational requirements for advanced models it provides often requires scaling beyond a single machine. In this session, we explore the integration between Hugging Face and Ray AI Runtime (AIR), allowing users to scale their model training and data loading seamlessly. We will dive deep into the implementation and API and explore how we can use Ray AIR to create an end-to-end Hugging Face workflow, from data ingest through fine-tuning and HPO to inference and serving.
Hugging Face Transformers is a popular open-source project with cutting edge Machine Learning (ML), but meeting the computational requirements for advanced models it provides often requires scaling beyond a single machine. In this session, we explore the integration between Hugging Face and Ray AI Runtime (AIR), allowing users to scale their model training and data loading seamlessly. We will dive deep into the implementation and API and explore how we can use Ray AIR to create an end-to-end Hugging Face workflow, from data ingest through fine-tuning and HPO to inference and serving.
The computational and memory requirements for fine-tuning and training these models can be significant. To deal with this issue, the Ray team has developed a Hugging Face integration for Ray AI Runtime (AIR), allowing Transformers model training to be easily parallelized across multiple CPUs or GPUs in a Ray Cluster, saving time and money, all the while allowing to take advantage of the rich Ray ML ecosystem thanks to common API.
In this session, we explore the integration between Hugging Face and Ray AIR, allowing users to scale their model training and data loading seamlessly. We will dive deep into the implementation and API and explore how we can use Ray AIR to create an end-to-end Hugging Face workflow, from data ingest through fine-tuning and HPO to inference and serving.
Key Takeaways:
Python developers and machine learning engineers can use Transformers and scale their language models.
Get exposed to Ray AIR’s Python APIs for end-to-end Hugging Face and ML workflow.
Understand how Ray AIR, built atop Ray, can scale your Python-based ML workloads.
Antoni Baum
2022-12-03T19:00:00+00:00
19:00
01:30
Talk Track II
cfp-309-lightning-talks
https://global2022.pydata.org//cfp/talk/DKQF3R/
false
Lightning Talks
Lightning Talk
en
<b>Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.</b>
<b>Order of Presentations</b>
1. Life, Death, and Shopping by Dina Bavli
2. Everybody Is an Influencer– Which Influencer Are You? by Dina Bavli
3. Geometry for Social Good by Colleen M. Farrelly
4. Build a web app in Python with Dash by David Chapuis
Colleen M. FarrellyDavid ChapuisDina Bavli
2022-12-03T11:00:00+00:00
11:00
02:00
Workshop/Tutorial I
cfp-261-workflows-deep-dive-from-data-engineering-to-machine-learning
https://global2022.pydata.org//cfp/talk/RNVDBT/
false
Workflows Deep Dive: From Data Engineering to Machine Learning
Workshops
en
Programmers, regardless of their level of experience, enjoy solving increasingly complex challenges within their domains of expertise, and one of the main reasons they can spend more time working on different challenges is because of the workflows they put in place around their projects. Data Engineers build pipelines to make sure the company's data is in optimal condition for Analysts to answer business critical questions, for Data Scientists to automate the selection, engineering, and analysis of distinct features before training models, and for machine learning engineers to know where to get data from, or send it to, for the APIs they build. On the other hand, developers automate the infrastructures of software products to reduce time to market of new features. These groups of data professionals and engineers are not too foreign to each other as they all speak the same language, Python. That said, the goal of this workshop is to dive deep into different workflow patterns for building pipelines for data and machine learning projects. In other words, this workshop bridges the gap between building one-off projects and building automated and reusable pipelines, all while creating an environment that welcomes both, newcomers and experts to either the data and machine learning fields or the engineering one.
## Description
In this 2-hour workshop, we'll cover 3 major workflow recipes for data engineering, data analytics and machine learning. While instructions for the workshop will be live, the materials we'll use all throughout will be provided prior to the session in the form of a GitHub repository.
Each section of the workshopt will last about 35-minutes, and the topics covered include building an ETL and ELT pipeline, a dashboard, and a machine learning pipeline that takes clean data in, transforms it, and culminates in a local API pointing to a machine learning model..
By the end of the workshop, participants will be able to speak some data engineering to their data analyst colleagues, and some analytics to their machine learning team (slang words included). In addition, they will walk away with different workflow orchestration templates that you can adapt to other projects.
## Audience
The target audience for this session includes analysts of all levels, developers, data scientists and engineers wanting to learn workflow creation and orchestration best practices to increase their productivity with Python, and as programmers in general.
## Format
The tutorial has a 10-minute setup section, three major lessons of ~35 minutes each, and one 7-minute break. In addition, each of the major sections contains some allotted time for exercises that are designed to help solidify the content taught throughout the workshop.
## Prerequisites (P) and Good To Have's (GTH)
- **(P)** Attendees for this tutorial are expected to be familiar with Python (1 year of coding experience would be great).
- **(P)** Participants should be comfortable with loops, functions, lists comprehensions, and if-else statements.
- **(GTH)** While it is not necessary to have any knowledge of data- and ML-related libraries, some experience with pandas, NumPy, matplotlib, metaflow, dagster, scikit-learn, would be very beneficial throughout this tutorial.
- **(P)** Participants should have at least 5 GB of free space in their computers.
- **(GTH)** While it is not required to have experience with integrated development environments like VS Code or Jupyter Lab, having either of the two, plus anaconda installed, would be very beneficial for the session.
## Outline
Total time budgeted (including breaks) - 4 hours
1. **Introduction and Setup (~10 minutes)**
- Getting the environment set up. Participants can choose between VS Code or Jupyter Lab and those experiencing difficulties throughout the session will also have the option to walk through the workshop using an isolated environment in Binder.
- Flash instructors intro.
- Motivation for the workshop.
- Workflow Orchestration.
- Quick breakdown of the session.
2. **Recipe 1: Automating your data cleaning pipelines (~35 minutes)**
- Intro to the datasets.
- ETL pipelines with pandas and Dagster.
- Exercise (5-min).
3. **7-minute break**
4. **Recipe 2: Automating Analytical Tools (~35 minutes)**
- Creating a transformation and loading pipeline.
- Creating a dashboard.
- Moving data into a dashboard.
- Exercise (5-min).
5. **Recipe 3: Automating a Machine Learning Pipeline (~35 minutes)**
- Introduction to Metaflow and the dataset.
- Creating, saving, and scheduling flows.
- Exercise (7-min).
Ramon Perez
2022-12-03T13:00:00+00:00
13:00
01:30
Workshop/Tutorial I
cfp-203-visually-inspecting-data-profiles-for-data-distribution-shifts
https://global2022.pydata.org//cfp/talk/H3BMVC/
false
Visually Inspecting Data Profiles for Data Distribution Shifts
Tutorial
en
The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.
In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, such as visually inspecting histograms, applying statistical tests and ensuring quality with data validation checks.
Requirements: Access to Google Colab Environment
Additional Material: https://colab.research.google.com/drive/1xOcAq8NwPazmQFhXVEvzRxXw5LiFqvfj?usp=sharing
Tutorial Outline
Session 1 - Data Distribution Shift (25min + 5min Q&A)
In this session, we’ll introduce the concept of data distribution shifts, and exactly why this is a problem for ML practitioners. We will cover different types of distribution shifts and how to measure them with popular statistical packages.
This is a theoretical session with hands-on examples.
Session 2 - Facing the Real World (10min + 5min Q&A)
In the real world, we might not always have data readily available as we would like. In this session, we’ll cover several challenges presented by the real world, and how we can leverage data logging with the whylogs package to help us overcome these challenges.
This is a theoretical session with hands-on examples.
Session 3 Inspecting and Comparing Distributions with whylogs (15min + 10min Q&A)
In this session, we will explore the whylogs’ Visualizer Module and its capabilities, using the Wine Quality dataset as a use-case to demonstrate distribution shifts. We will first generate statistical summaries with whylogs and then visualize the profiles with the visualization module.
This is a Hands-on Notebook Session.
Session 4 - Data Validation (15min + 5min Q&A)
As discussed in previous sessions, data validation plays a critical role in detecting changes in your data. In this session, we will introduce the concept of constraints - ways to express your expectations from your data - and how to apply them to ensure the quality of your data.
This is a Hands-on notebook session.
Felipe de Pontes Adachi
2022-12-03T15:00:00+00:00
15:00
01:30
Workshop/Tutorial I
cfp-179-production-grade-machine-learning-with-flyte
https://global2022.pydata.org//cfp/talk/3SDFWF/
false
Production-grade Machine Learning with Flyte
Tutorial
en
MLOps encapsulates the discipline of – and infrastructure that supports – building and maintaining machine learning models in production. This tutorials highlight four challenges in carrying this out effectively: scalability, data quality, reproducibility, recoverability, and auditability. As a data science and machine learning practitioner, you’ll learn how Flyte, an open source data- and machine-learning-aware orchestration tool, is designed to overcome these challenges and you'll get your hands dirty using Flyte to build ML pipelines with increasing complexity and scale!
As the discipline of machine learning Operations (MLOps) matures, it’s becoming clear that, in practice, building ML models poses additional challenges compared to the traditional software development lifecycle. This tutorial will focus on four challenges in the context of ML model development: scalability, data quality, reproducibility, recoverability, and auditability. Using Flyte, a data- and machine-learning-aware open source orchestration tool, we’ll see how to address these challenges and abstract them out to give you a broader understanding of how to surmount them.
First I’ll define and describe what these four challenges mean in the context of ML model development. Then I’ll dive into the ways in which Flyte provides solutions to them, taking you through the reasoning behind Flyte’s data-centric and ML-aware design. We'll cover:
- **Flyte tasks and workflows**: the building blocks for expressing execution graphs
- **Dynamic workflows**: for defining execution graphs at runtime
- **Map tasks**: Scale embarrassingly parallel workflows
- **Plugins**: Extend Flyte's core functionality
- **Type System**: See the benefits of static type safety
- **DataFrame Types**: Validate dataframe-like objects at runtime
- **Reproducibility:** Containerize and harden your execution graph
- **Caching**: Don't waste precious compute re-running nodes
- **Recovering Executions**: Build fault-tolerant pipelines
- **Checkpointing**: Checkpoint progress within a node
- **Flyte Decks**: Create rich static reports associated with your data/model artifacts
Attendees will learn how Flyte distributes and scales computation, enforces static and runtime type safety, leverages Docker to provide strong reproducibility guarantees, implements caching and checkpointing to recover from failed model training runs, and ships with built-in data lineage tracking for full data pipeline auditability.
Tutorial Repo: https://github.com/flyteorg/flyte-conference-talks/tree/main/pydata-global-2022
Niels Bantilan