PyData Global 2022
Counterfactual explanations (CFE) are methods that
explain a machine learning model by giving an alternate class prediction
of a data point with some minimal changes in its features.
In this talk, we describe a counterfactual (CF)
generation method based on particle swarm optimization (PSO) and how we can have greater control over the proximity and sparsity properties
over the generated CFs.
This talk is about the approach we've taken at the Apache Airflow for managing our dependencies at scale of a project that is the most popular Data Orchestrator in the world, consists of ~ 80 independent package and has more than 650 depenencies in total (and did not loose our sanity).
When your goal of the study is to analyze and forecast volatility, this is where the ARCH/GARCH models comes into the picture to solve the complicated time series problems.
For enterprises to adopt and embrace AI into their transformational journey, it is imperative to build Trustworthy AI- so that AI products and solutions that are built, delivered, and acquired are responsible enough to drive trust and wider adoption. We look at AI Trust as a function of 4 key constructs which include Reliability, Safety, Transparency, Responsibility and Accountability. These core constructs are pillars of driving AI trust in our products and solutions. In this talk, I will explain how to enable each core construct and will articulate how they can be measured in some real-world use cases.
Have you ever trained an awesome model just to have it break in production because of a null value? At its core a feature store needs to provide reliable features to data scientists to build and productionize models. So how can we avoid garbage in, garbage out situations? Great expectations is the most popular library for data validation, and so the two are a natural fit. In this talk we will touch briefly upon different Python data validation libraries such as Pydantic, Pandera but then dive deeper into Great Expectations’ concepts and how you can leverage them in feature pipelines powering a feature store.
The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays several open-source projects claim to improve pandas in various ways, either by bringing it to a distributed computing setting (Dask), accelerating its performance with minimal changes (Modin), or offering slightly different API that solves some of its shortcomings (Polars).
In this talk we will dive into Polars, a new dataframe library backed by Arrow and Rust that offers an expressive API for dataframe manipulation with excellent performance.
If you are a seasoned pandas user willing to explore alternatives, or a beginner user wondering what all the fuzz about these new dataframe libraries is, this talk is for you!
One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this workshop, we present a hands-on introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. Participants will learn how to take common machine learning models, such as those from scikit-learn, XGBoost, and Keras, and productionize them using Metaflow.
We’ll present a high-level overview of the 8 layers of the ML stack: data, compute, versioning, orchestration, software architecture, model operations, feature engineering, and model development. We’ll present a schematic as to which layers data scientists need to be thinking about and working with, and then introduce attendees to the tooling and workflow landscape. In doing so, we’ll present a widely applicable stack that provides the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure.
You can find the companion repository for the workshop here: https://github.com/outerbounds/full-stack-ML-metaflow-tutorial.
Inequality joins are less frequent than equality joins, but are useful in temporal analytics and even in some conventional applications. Pyjanitor fills this gap in Pandas with an efficient implementation
Are you fascinated by the real-life images or text produced by deep generative models but cannot interpret their underlying data generation process or see how they can be applied to other problems? I will talk about generative simulations built using knowledge of the problem domain that can produce realistic data in a variety of scenarios. This talk will be a Bayesian thinking exercise cum data science case study of product star rating timeseries from an online marketplace (like Amazon.com) – I will show how we use recent advances in likelihood-free Bayesian inference together with a detailed simulation of an online marketplace to directly infer factors involved in how customers purchase and rate products.
Data Centric AI is about iterating on data instead of models to improve machine learning predictions. Why is this trend relevant now? Is this yet another hype in data science? Or has something really changed? And most of all -- how is this relevant to you?
It’s crunchy! It’s sweet! Maybe it is the presence of the nuts or their absence. There are various features that make you favor a particular cereal. Now surely, if we modeled the consumer ratings for cereals, some features would be considered more important than others. After all, feature engineering is one of the most critical steps in modeling. But after the model is up and running, what if we tweak the features just to see how much meddling can affect the preference? This process is called post-hoc feature attribution and it seeks to interpret the model behavior. In this talk, let us spoon through the interpretability of ML models.
This session will discuss scaling your PyTorch models on TPUs. We’ll also cover an overview of ML accelerators and distributed training strategies. We’ll cover training on TPUs from beginning to end, including setting them up, TPU architecture, frequently faced issues, and debugging techniques. You’ll learn about the experience of using the PyTorch XLA library and explore best practices for getting started with training large-scale models on TPUs.
Build data pipelines using Trino and dbt, combining heterogeneous data sources without having to copy everything into a single system. Manage access to your data products using modern and flexible security principles from authentication methods to fine-grained access control. Run and monitor your data pipelines using Dagster.
Hello wait you talk see to can’t all my in!
Sounds weird, right?! Detecting abnormal sequences is a common problem.
Join my talk to see how this problem involves Bert, Word2vec, and Autoencoders, and learn how you can also apply it to information security
Meaningful probabilistic models do not only produce a “best guess” for the target, but also convey their uncertainty, i.e., a belief in how the target is distributed around the predicted estimate. Business evaluation metrics such as mean absolute error, a priori, neglect that unavoidable uncertainty. This talk discusses why and how to account for uncertainty when evaluating models using traditional business metrics, using python standard tooling. The resulting uncertainty-aware model rating satisfies the requirements of statisticians because it accounts for the probabilistic process that generates the target. It should please practitioners because it is based on established business metrics. It appeases executives because it allows concrete quantitative goals and non-defensive judgements.
sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack. sktime has a rich framework for building pipelines across multiple learning tasks that it supports, including forecasting, time series classification, regression, clustering. This tutorial explains basic and advanced sktime pipeline constructs, and introduces in detail the time series transformer which is the main component in all types of pipelines. It is a continuation of the sktime introductory tutorial at pydata global 2021.
Numerous tools generate "explanations" for the outputs of machine-learning models and similarly complex AI systems. However, such “explanations” are prone to misinterpretation and often fail to enable data scientists or end-users to assess and scrutinize “an AI.” We share best practices for implementing “explanations” that their human recipients understand.
This talk will show you how to build papermill plugins. As motivating examples, we'll describe how to customize papermill for notebook debugging and profiling.
Data is everywhere. It is through analysis and visualization that we are able to turn data into information that can be used to drive better decision making. Out-of-the-box tools will allow you to create a chart, but if you want people to take action, your numbers need to tell a compelling story. Learn how elements of storytelling can be applied to data visualization.
In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d be mainly talking about Zarr’s Python implementation and would show how it beautifully interoperates with the existing libraries in the PyData stack.
Correlation does not imply causation. It turns out, however, that with some simple ingenious tricks one can unveil causal relationships within standard observational data, without having to resort to expensive randomised control trials. Learn how to make the most out of your data, avoid misinterpretation pitfalls and draw more meaningful conclusions by adding causal inference to your toolbox.
The talk includes the presentation of Crowd-Kit - an open-source computational quality control library - followed by its demonstration.
Crowdsourced annotations in most cases require post-processing due to their heterogeneous nature; raw data contains errors, is biased and non-trivial to combine. Crowd-Kit provides various methods like aggregation, uncertainty, and agreements, which could be used as helping tools in getting an interpretable result out of data labeled with the help of crowdsourcing.
Getting your team to choose good projects, reliably derisk them, research ideas, productionise the solutions and create positive change in an organisation is hard. Really hard.
I'll present patterns that work for these 5 critical project stages. This guidance is based on 15 years of experience writing AI and DS solutions and 5 years giving both strategic guidance training on how to get to success.
You'll come away from the session with new techniques to help your team deliver successfully and increase their confidence in the roadmap, new thoughts on how to diagnose your model's quality and new ideas to make positive difference in your organisation.
Ada is the Founder of She Code Africa (SCA).
In today’s digital age, we use machine learning (ML) and artificial intelligence (AI) to solve problems and improve productivity and efficiency. Yet, there’s risk in delegating decision-making power to algorithmically based systems: their workings are often opaque, turning them into uninterpretable “black boxes.”
Bytewax is an open source, Python native, framework and distributed processing engine for processing data streams that makes it easy to build everything from pipelines for anonymizing data to more sophisticated systems for fraud detection, personalization, and more. For this tutorial, we will cover how you can use Bytewax and the Python library, River, to build an online machine learning system that will detect anomalies in IoT data from streaming systems like Kafka and Redpanda. This tutorial is for data scientists, data engineers, and machine learning engineers interested in machine learning and streaming data. At the end of the tutorial session you will know how to:
- run a streaming platform like Kafka or Redpanda in a docker container,
- develop a Bytewax dataflow
- run a River anomaly detection algorithm to detect anomalous data
The tutorial material will be available via a GitHub Repo and the content will be covered in roughly the timeline shown below.
- 0-10min - Introduction to stream processing and online machine learning
- 10-30min - Setup streaming system and prepare the data
- 30-60min - Write the Bytewax dataflow and anomaly detector code
- 60-90min - Tune the anomaly detector and run the Bytewax dataflow successfully.
This tutorial is a hands-on introduction to Bayesian Decision Analysis (BDA), which is a framework for using probability to guide decision-making under uncertainty. I start with Bayes's Theorem, which is the foundation of Bayesian statistics, and work toward the Bayesian bandit strategy, which is used for A/B testing, medical tests, and related applications. For each step, I provide a Jupyter notebook where you can run Python code and work on exercises. In addition to the bandit strategy, I summarize two other applications of BDA, optimal bidding and deriving a decision rule. Finally, I suggest resources you can use to learn more.
Executives at PyData is a facilitated discussion session for leaders on the challenges around designing and delivering successful projects, organizational communication, product management and design, hiring, and team growth.
Join here
Recently Sam Gross, the author of nogil fork on Python 3.9, demonstrates the GIL can be removed. For scientific programs which use heavy CPU-bound processes, it could be a huge performance improvement. In this talk, we will see if this is true and compare the nogil version to the original.
Dask is a framework for parallel computing in Python.
It's great, until you need to set it up.
Kubernetes? Cloud? HPC? SSH? YARN/Hadoop even?
What's the right deployment technology to choose?
After you set it up a new set of problems arise:
- How do you install software across the cluster?
- How do you secure network access?
- How do you access secure data that needs credentials?
- How do you track who uses it and constrain costs?
- When things break, how do you track them down?
There exist solutions to these problems in open source packages like dask-kubernetes, helm charts, dask-cloudprovider, and dask-gateway, as well as commercially supported products like Coiled, Saturn, QHub, AWS EMR, and GCP Dataproc. How do we choose?
This talk describes the problem faced by people trying to deploy any distributed computing system, and tries to construct a framework to help them make decisions on how to deploy.
The Jupyter ecosystem has been undergoing many changes in the past few years. While JupyterLab has been embraced by many, there are still many active users of Jupyter Notebook. With that in mind, Jupyter developers have been gearing up for the release of the updated Notebook 7 based on JupyterLab components as outlined in the Jupyter Enhancement Proposal #79. With this, there are significant changes coming to Notebook 6, of which the upcoming Notebook 6.5 is intended to be end-of-life, and users installing Notebook will soon receive a version of the project that may disrupt their workflows. In an effort to give users time to transition to using the updated codebase, the NbClassic project has been introduced. NbClassic is the Jupyter Server extension implementation of the classical notebook. NbClassic has also become the owner of the static assets for the classical notebook, and Notebook 6.5 depends on NbClassic to provide those.
The aim of this talk is to:
1. Reflect on the changes to the Jupyter ecosystem with the introduction of NbClassic and Notebook 7.
2. Address some questions that may come up about NbClassic and Notebook 6.5, as well as some of those that may come up once Notebook 7 is released.
3. Showcase the feasibility with which users can use the different front-ends NbClassic, Notebook 7 and JupyterLab with a demo.
Domain experts often need to create text classification models; however, they may lack ML or coding expertise to do so. In this talk, we show how domain experts can create text classifiers without writing a single line of code through the open-source, no-code Label Sleuth system (www.label-sleuth.org); a system that combines an intuitive labeling UI with active learning techniques and integrated model training functionality. Finally, we describe how the system can also benefit more technical users, such as data scientists, and developers, who can customize it for more advanced usage.
We’re on a global mission to make open source software thrive and be more sustainable—from supporting open source contributors in their career paths with our Open Source Professional Network (OSPN) to helping organizations transform their business with support from our vetted network of enterprise solution architects (ESA Network) to helping our clients select the right open source software stack for their business challenge by leveraging our AI-driven scoring system. Please join us during Sponsor Open Hours to learn more and ask us anything about open source.
Pandas’ current behavior on whether indexing returns a view or copy is confusing, even for experienced users. But it doesn’t have to be this way. We can make this aspect of pandas easier to grasp by simplifying the copy/view rules, and at the same time make pandas more memory-efficient. And get rid of the SettingWithCopyWarning.
PyScript has brought change to the Python and PyData eco-system making it much easier to execute Python in the browser and opening the road for multiple possibilities that were not possible. The talk will explore what happened since we presented it and will talk about how PyScript can change the way we do Data Science and many other things.
AI is the future of software development
Many Python data professionals work daily in JupyterLab or Notebook instances. What can a hacker do with access to that system? In this presentation, I will introduce the threat model and show why Jupyter instances are valuable targets. Next, I will demonstrate several post-exploitation activities that someone may try to perform on systems hosting Jupyter instances. We will conclude with some defensive strategies to minimize the likelihood and impact of these activities. This talk will help data scientists and information technology professionals better understand the perspective of potential attackers operating in Jupyter environments to improve defensive awareness and behavior.
Stuck with long-running code that takes too long to complete, if ever? Learn to think strategically about parallelizing your workflows, including the characteristics that make a workflow a good candidate for parallelization as well as the options in python for executing parallelization. The talk eschews PySpark or other big data platforms.
The Ray project has show that having a shared memory facility greatly helps in certain compute problems, particularly where the job can be performed on a single large machine as opposed to a cluster. We present preliminary work showing that Dask can also achieve the same benefits.
Extracting the highly valuable data from unstructured text often results in hard-to-read, brittle, difficult-to-maintain code. The problem is that using regular expressions directly embedded in the program control flow does not provide the best level of abstraction. We propose a query language (based on the tuple relational calculus) that facilitates data extraction. Developers can explicitly express their intent declaratively, making their code much easier to write, read, and maintain.
RStudio recently changed its name to Posit to reflect the fact that we're already a company that does more than just R. Come along to this talk to hear a few of the reasons that we love R, and to learn about some of the open source tools we're working on for python.
Machine Learning models designed to work with streaming systems make decisions on new data points as they arrive. But there is a downside: model decisions can't be easily changed later when the model is updated with fresher data, user feedback, or freshly tuned hyperparameters. This is often a blocker for anomaly detection, recommender systems, process mining, and human-in-the-loop planning.
To deal with this, we'll demonstrate design patterns to easily express reactive data processing logic. We will use Pathway, a scalable data processing framework built around a Python programming interface. Pathway is battle-tested with operational data in enterprise, including graphs and event streams in real-world supply chains, and is now launching as open-core.
You will leave the talk with a thorough understanding of the practical engineering challenges behind reactive data processing with a Machine Learning angle to it, and the steps needed to overcome these challenges.
Apache Airflow is a foundational component of data platform orchestration at Shopify. In this talk, we'll dive into the many performance and reliability challenges we’ve encountered running Airflow at Shopify’s scale, our custom tooling, and the new multi-instance architecture we rolled out.
Apache Airflow is a foundational component of data platform orchestration at Shopify. Following the main talk, this is a session is scheduled for you to ask and discuss running Airflow at scale with Jean-Martin Archer, Staff Data Engineer at Shopify and Michael Petro, Data Engineer at Shopify
Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.
All languages are rich in prose and poetry. A lot of the literature is inaccessible because of a lack of understanding of that language. It is often difficult to appreciate a simple translation of a poem due to gaps in cultural knowledge. A poem translated in the style of an author familiar to the reader might help to both add cultural context for the reader and capture the essence of the poem itself.
For video advertisers, precisely hitting their ad performance goals is critical. Undershooting on campaign viewability objectives means spending money on ads that nobody watches, while overshooting them can mean vastly reducing the available ad slots. At JW Player, we combine predictive models with PID controllers to tune decision thresholds and deliver the maximum possible reach to our advertisers while hitting their goals.
Data science practitioners have a saying that a 80% of their time gets spent on data prep. Often this involves tools such as Pandas and Jupyter. Graph Data Science is similar, except the data prep techniques are highly specialized and computationally expensive. Moreover, data prep for graphs is required before commercial tools such as graph databases or visualization can be used effectively. This talk shows examples of data prep for graphs. A progressive example illustrates the challenges plus techniques that leverage open source integrations with the PyData stack: Arrow/Parquet, PSL, Ray, Keyvi, Datasketch, etc.
Introducing a new project, Compute over Data (Bacalhau), to run any computation on decentralized data. No need to move large datasets & all languages/data are supported. If you can run Docker/WASM, you're in the game!
Bacalhau is a decentralized public computation network that takes a job and moves it near where the data stored, including across a decentralized server network that stores data and runs jobs inside it. Bacalhau runs the job near where data lives and eliminates data management for the user.
Data science as a professional discipline is still in its infancy, and our field lacks widespread technical norms around project organization, collaboration, and reproducibility. This is painful both for practitioners and their end users because disorganized analysis is bad analysis, and bad analysis costs money and wastes time. This talk presents ten principles for correct and reproducible data science inheriting from software engineering’s seven decades of hard-earned lessons as well as numerous experiences with data science teams at organizations of all sizes. We motivate these principles by looking at some hard truths about data science “in the wild.”
KerasCV offers a complete set of APIs to train your own state-of-the-art,
production-grade object detection model. These APIs include object detection specific
data augmentation techniques, models, and COCO metrics.
This talk covers how to train a RetinaNet on your own dataset using KerasCV
Where are CA’s frequent, high quality transit corridors? The CA Public Resources Code defines it, but it requires continued access of the General Transit Specification Feed (GTFS) data and fairly complex geospatial processing. The Integrated Travel Project within Caltrans tackles this by leveraging the combined powers of Dask and Python to make this dataset publicly available and updated monthly on the CA open data portal.
We present BastionAI, a new framework for privacy-preserving deep learning leveraging secure enclaves and Differential Privacy.
We provide promising first results on fine-tuning a BERT model on the SMS Spam Collection Data Set within a secure enclave with Differential Privacy.
The library is available at https://github.com/mithril-security/bastionai.
Visual Studio Code is one of the most popular editors in the Python and data science communities, and the extension ecosystem makes it easy for users to easily customize their workspace for the tools and frameworks they need.
Jupyter notebooks are one such popular tool, and there are some really great features for working in notebooks that can reduce context switching, enable multi-tool workflows, and utilize powerful Python IDE features in notebooks.
This tutorial is geared for all Jupyter Notebook users, who either have interest in or are regularly using VS Code.
Participants will learn how to use some of the best VS Code features for Jupyter Notebooks, as well as a bunch of other tips and tricks to run, visualize and share your notebooks in VS Code.
Some familiarity with Jupyter Notebooks is required, but experience with VS Code is not necessary.
Materials and sample notebooks for the tutorial will be hosted on GitHub, which participants will be able to launch in their browser in the VS Code editor with GitHub Codespaces with no local setup.
Participants will also be encouraged if they have VS Code installed locally that they can open one of their own notebooks and try out the features as we go along.
We like talking about production – one famous, but probably wrong statement about it is “87% of data science projects never make it to production”.
While giving a talk to a group of up-and-coming data scientists, a question that surprised me came up:
When you say “production”, what exactly do you mean?
Buzzwords are great, but all the cool kids know what production is, right? Wrong.
In this talk, we’ll define what production actually means. I’ll present a first-principles, step-by-step approach to thinking about deploying a model to production. We’ll talk about challenges you might face in each step, and provide further reading if you want to dive deeper into each one.
In hypothesis testing the stopping criterion for data collection is a non-trivial question that puzzles many analysts. This is especially true with sequential testing where demands for quick results may lead to biassed ones.
I show how the belief that Bayesian approaches magically resolve this issue is misleading and how to obtain reliable outcomes by focusing on sample precision as a goal.
Mostly, people relate Artificial Intelligence to progress, intelligence and productivity. But with this comes unfair decisions, biases, human workforce being replaced, lack of privacy and security. And to make matters worse, a lot of these problems are specific to AI. This indicates that the rules and regulations in place are inadequate to deal with them. Responsible AI comes into play in this situation. It seeks to resolve these problems and establish AI system responsibility. In this talk I am going to talk about What is Responsible AI, Why is it needed, How it can be implemented, What are the various frameworks for Responsible AI and What is the Future?
Data practitioners are typically forced to choose between tools that are either easy to use (pandas) or highly scalable (Spark, SQL..etc.). Modin, an open source project originally developed by researchers at UC Berkeley, is a highly scalable, drop-in replacement for pandas.
This talk will give an overview of Modin and practical examples on how to use it to effortlessly scale up your pandas workflows.
Recent advances in natural language processing demonstrate the capability of large-scale language models (such as GPT-3) to solve a variety of NLP problems with zero shots shifting from supervised fine-tuning to prompt engineering/tuning.
Want to create beautiful and complex visualisations of your data with concise code? Look no further than Seaborn, Python’s fantastic plotting library which builds on the hugely popular Matplotlib package. This hands-on tutorial will provide you with all the necessary tools to communicate your data insights with Seaborn.
Machine learning models degrade with time. You need to update and retrain them regularly. However, the decision on the maintenance approach is often arbitrary, and the models are simply retrained on a schedule or after every new batch. This can lead to suboptimal performance or wasted resources. In this talk, I will discuss how we can do better: from estimating the speed of the model decay in advance to constructing a proper evaluation set.
There is a rich ecosystem of libraries for Bayesian analysis in Python and it is necessary to use multiple libraries at the same time to use a Bayesian workflow, from model creation to presenting results going through sampling and model checking.
This working session aims to bring together practitioners to discuss and address interoperability issues within the ecosystem. Attendees should expect a hands-on get together where they will meet other Bayesian practitioners with whom to discuss the issues faced and contribute to open source libraries with issues, pull requests and discussions.
Python has many different packages that are useful for working with different kinds of geographical data. This presentation will introduce several of these packages and show you how you can get started working with geolocated information and presenting insights on maps.
OpenSearch is an open source document database with search and aggregation superpowers, based on Elasticsearch. This session covers how to use OpenSearch to perform both simple and advanced searches on semi-structured data such as a product database.
pandas has rapidly become one of the most popular tools for data analysis, but is limited by its inability to scale to large datasets. We developed Modin, a scalable, drop-in alternative to pandas, that preserves the dynamic and flexible behavior of pandas dataframes while enhancing the scalability.
This talk will walk you through our team’s research at UC Berkeley, which enabled the development of Modin. We’ll also discuss our latest publication at VLDB, which covers a novel approach to parallelization and metadata management techniques for dataframes.
Transformer models became state-of-the-art in natural language processing. Word representations learned by these models offer great flexibility for many types of downstream tasks from classification to summarization. Nonetheless, these representations suffer from certain conditions that impair their effectiveness. Researchers have demonstrated that BERT and GPT embeddings tend to cluster in a narrow cone of the embedding space which leads to unwanted consequences (e.g. spurious similarities between unrelated words). During the talk we’ll introduce SimCSE – a contrastive learning method that helps to regularize the embeddings and reduce the problem of anisotropy. We will demonstrate how SimCSE can be implemented in Python.
Understanding dependencies between features is crucial in the process of developing and interpreting black-box ML models. Mistreating or neglecting this aspect can lead to incorrect conclusions and, consequentially, sub-optimal or wrong decisions leading to financial losses or other undesired outcomes. Many common approaches to explain ML models – as simple as feature importance or more advanced methods such as SHAP – can yield misleading results if mutual feature dependencies are not taken into account.
In this talk we present FACET 2.0 - a new approach for global feature explanations using a new technique called SHAP vector projection, open-sourced at: https://github.com/BCG-Gamma/facet/.
Automatic testing for ML pipelines is hard. Part of the executed code is a model that was dynamically trained on a fresh batch of data, and silent failures are common. Therefore, it’s problematic to use known methodologies such as automating tests for predefined edge cases and tracking code coverage.
In this talk we’ll discuss common pitfalls with ML models, and cover best practices for automatically validating them: What should be tested in these pipelines? How can we verify that they'll behave as we expect once in production?
We’ll demonstrate how to automate tests for these scenarios and introduce a few open-source testing tools that can aid the process.
To develop mature data science, machine learning, and deep learning applications, one must develop a large number of pipeline components, such as data loading, feature extraction, and frequently a multitude of machine learning models.
Sharing and explaining the results of your analysis can be a lot easier and much more fun when you can create an animated story of the charts containing your insights. ipyvizzu-story - a new open-source presentation tool for Jupyter & Databricks notebooks and similar platforms - enables just that using a simple Python interface.
In this workshop, one of the creators of ipyvizzu-story introduces this tool and helps the audience take the first steps in utilizing the power of animation in data storytelling. After the workshop, the members can build and present animated data stories independently.
Identifying the right tools to enable for high performance machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. This becomes particularly emphasised when dealing with the ever growingly popular large language and image generation models such as GPT2, OTP and DALL-E, between others. In this session we will dive into a practical showcase where we will be productionising the large image generation model DALL-E, and showcase some optimizations that can be introduced as well as considerations as the use-cases scale. By the end of this session practitioners will be able to run their own DALL-E powered applications as well as integrate these with functionalities from other large language models like GPT2, etc. We will be leveraging key tools in the Python ecosystem to achieve this, including Pytorch, HuggingFace, FastAPI and MLServer.
Most organisations habe implemented some kind of dashboard to monitor their data, processes, or business. However, many dashboard solutions come with a caveat – either the licensing costs, lack of transparency in the workflows, limited creativity, or they cannot be connected to existing infrastructure.
This talk is aimed at Data Scientists, Data Engineers, Data Practitioners and Managers struggling with choosing between a myriad of commercial dashboard solutions and DIY. We present how to create your own dashboard using open-source Python technologies like FastAPI, SQLAlchemy, and Celery and the challenges involved. We look back at the pitfalls and solutions we have worked on over the past 3 years. The goal is not to present our unique solution, but to show how we can combine different Python libraries to implement custom solutions to solve different use cases. Attendees should be familiar with the basic concepts of web infrastructure. Previous knowledge of any libraries is not required. We hope to provide a starting point to build your custom dashboard solution using open-source tooling.
Getting predictions from transformer models such as BERT requires two steps: first to query the tokenizer and then feed the outputs to the deep learning model itself. These two parts of the model are kept under different class implementations in popular open source implementations like Huggingface Transformers and Sentence-Transformers. This works well within Python but when one wants to put such a model in production or convert it to more efficient formats like onnx that may be served by other languages such as JVM-based it is preferable and simpler (and less risky) to have a single artifact that is directly queried. This talk builds on the popular sentence-transformers library and shows how one can transform a sentence-transformer model into a single tensorflow artifact that can be queried with strings and is ready for serving. At the end of the talk the audience will get a better understanding of the architecture of sentence-transformers and the required steps for converting a sentence-transformer model to a single tensorflow graph. The code is released as a set of notebooks so that the audience can replicate the results.
Organisations have been growingly adopting and integrating a non-trivial number of different frameworks at each stage of their machine learning lifecycle. Although this has helped reduce time-to-value for real-world AI use-cases, it has come at a cost of complexity and interoperability bottlenecks.
Numerous scientific disciplines have noticed a reproducibility crisis of published results. While this important topic was being addressed, the danger of non-reproducible and unsustainable research artefacts using machine learning in science arose. The brunt of this has been avoided by better education of reviewers who nowadays have the skills to spot insufficient validation practices. However, there is more potential to further ease the review process, improve collaboration and make results and models available to fellow scientists. This workshop will teach practical lessons that can be directly applied to elevate the quality of ML applications in science by scientists.
Have you ever wondered what it takes to build a production grade Machine Learning platform? With so many OSS tools and frameworks it can get overwhelming at times how to make everything work. In this workshop we will build a production grade Model training, Model Serving, Model Monitoring platform on AWS EKS. Nothing will be local. These ideas can serve ML Engineers, Applied Data Scientists & Researchers to further extend them and develop a holistic picture of building an ML Platform on OSS.
Starting a new data science project is an exciting time, full of exotic models possibilities and faraway incredible features. However this ocean of potentialities is treacherous and the risks of veering off numerous.
This talk aims to provide a checklist to help you set a course for your data science project, and keep it. An industrial project about images pseudo-classification will be used as a working example.
A somewhat beginner's guide on running neural networks on micro-controllers, understanding the training pipeline, deployment and how to update the deployed model
The Awkward Array project provides a library for operating on nested,
variable length data structures with NumPy-like idioms. We present two
projects that provide native support for Awkward Arrays in the broader
PyData ecosystem. In dask-awkward we have implemented a new Dask
collection to scale up and distribute workflows with partitioned
Awkward Arrays. In awkward-pandas we have implemented a new Pandas
extension array type, making it easy to use Awkward Arrays in Pandas
workflows and enabling massive acceleration in the processing of
nested data. We will show how these projects plug into PyData and
present some compelling use cases.
The virtual chemical universe is expanding rapidly as open access titan databases Enamine Database (20 Billion), Zinc Database (2 Billion), PubMed Database (68 Million) and cheminformatic tools
to process, manipulate, and derive new compound structures are being established. We present our open source knowledge graph, Global-Chem, written in python to distribute dictionaries of common chemical lists of relevant to different sub-communities out to the general public i.e What is inside Food? Cannabis? Sex Products? Chemical Weapons? Narcotics? Medical Therapeutics?
To navigate new chemical space we use our data as a reference index as to help us keep track of common patterns of interest and help us explore new chemicals that could be theoretically real. In our talk, we will present the chemical data, the rules governing the data and it's integrity, and how to use our tools to understand the chemical universe with python.
Machine learning operations (MLOps) are often synonymous with large and complex applications, but many MLOps practices help practitioners build better models, regardless of the size. This talk shares best practices for operationalizing a model and practical examples using the open-source MLOps framework vetiver
to version, share, deploy, and monitor models.
DJ Patil is the former U.S. Chief Data Scientist
In this workshop, attendees will learn how to create data annotation guidelines from a user experience (UX) perspective.
Creating annotation guidelines from a UX perspective means imbuing them with usability, resulting in a better experience for annotators, and more effective and productive annotation campaigns. With Python being at the forefront of Machine Learning and data science, we believe that the Python community will benefit from learning more about the design of data annotation guidelines and why they are essential for creating great machine learning applications.
We want to present Cleora – an open-source tool for creating compact representation of the behavior of your client. Cleora uses graph theory to transform streams of event data into embedding. It is suitable as an input for training models like churn, propensity and recommender systems. This is a talk useful for anyone who wishes to learn how to work with event data of clients and how to model client's behavior.
Data practitioners use distributed computing frameworks such as Spark, Dask, and Ray to work with big data. One of the major pain points of these frameworks is testability. For testing simple code changes, users have to spin up local clusters, which have a high overhead. In some cases, code dependencies force testing against a cluster. Because testing on big data is hard, it becomes easy for practitioners to avoid testing entirely. In this talk, we’ll show best practices for testing big data applications. By using Fugue to decouple logic and execution, we can bring more tests locally and make it easier for data practitioners to test with low overhead.
Nowadays we know the social media and tech giants are honesting tons of data from their users and most of us agree that the capability of these companies to deliver their suggestions and customization for you is driven by big data.
However, this brings a question: Is more data always better? Do more data equal to more accurate model? When do you need big data and when does it start becoming a bad idea? Let's find out in this panel session.
Production workflows in machine learning has it's own requirements compared to DevOps. In this talk, I will present a new library we are developing called "skops" that's built to improve production workflows for scikit-learn models.
This talk will go into how Deep Learning is changing the world of Cheminformatics. We will dive deep into how we can leverage traditional NLP Transformer models can enable us to performing a totally uncorrelated task such as Drug Discovery. This talk will give a brief introduction to the field of Cheminformatics and then go into detail as to how and what kind of Transformers can be utilized for the task at hand.
What if you're a two man machine learning team deploying models to users? What if you don't have a full blown team of Data Engineers working with you? What if nobody around you cares about making that nasty production data available in a pristine feature store? What if you don't even have time to build out your entire Machine Learning platform?
There must be a way to still deliver your ML model to users right? There must be way to deliver value.
In this session, I'll talk about how small teams address the problem of delivering ML-value to users. At a reasonable scale. I'll go over some misconceptions and lessons-learned from 4 years working with early-stage startups.
Moving data in and out of a warehouse is both tedious and time-consuming. In this talk, we will demonstrate a new approach using the Snowpark Python library. Snowpark for Python is a new interface for Snowflake warehouses with Pythonic access that enables querying DataFrames without having to use SQL strings, using open-source packages, and running your model without moving your data out of the warehouse. We will discuss the framework and showcase how data scientists can design and train a model end-to-end, upload it to a warehouse and append new predictions using notebooks.
Everyone who codes can save time by reusing configuration — whether for logging in to cloud providers or databases, spinning up Docker containers, or sending notifications. The Prefect open source library provides you with blocks - sharable, reusable, and secure configuration with code. Blocks can be created and edited through the Prefect UI or Python code, allowing for easier collaboration with team members of all skill levels.
Model training is a time-consuming, data-intensive, and resource-hungry phase in machine learning, with much use of storage, CPUs, and GPUs. The data access pattern in training requires frequent I/O of a massive number of small files, such as images and audio files. With the advancement of distributed training in the cloud, it is challenging to maintain the I/O throughput to keep expensive GPUs highly utilized without waiting for access to data. The unique data access patterns and I/O challenges associated with model training compared to traditional data analytics necessitate a change in the architecture of your data platform.
Gabriela de Queiroz is a Principal Cloud Advocate Manager at Microsoft.
NetworkX is the most popular graph/network library in Python. It is easy to use, well documented, easy to contribute to, extremely flexible, and extremely slow for large graphs.
An upcoming release begins to fix that last issue by calling fast GraphBLAS implementations instead of the native Python implementation.
If you use NetworkX or have ever written a graph algorithm, this talk will be of interest to you as it shows how NetworkX is planning on a path of pluggable algorithm libraries so users can opt-in to faster implementations with minimal code changes.
Data pipelines consist of graphs of computations that produce and consume data assets like tables and ML models.
Data practitioners often use workflow engines like Airflow to define and manage their data pipelines. But these tools are an odd fit - they schedule tasks, but miss that tasks are built to produce and maintain data assets. They struggle to represent dependencies that are more complex than “run X after Y finishes” and lose the trail on data lineage.
Dagster is an open-source framework and orchestrator built to help data practitioners develop, test, and run data pipelines. It takes a declarative approach to data orchestration that starts with defining data assets that are supposed to exist and the upstream data assets that they’re derived from.
Attendees of this session will learn how to develop and maintain data pipelines in a way that makes their datasets and ML models dramatically easier to trust and evolve.
Machine learning algorithms, especially artificial neural networks, are not tolerant of missing data. Many practitioners simply remove records with missing fields without any consideration for the potential statistical bias that might be introduced. The field of imputation has become mature with imputations not only predicting missing values, but reflecting the uncertainty in the prediction. Traditional statistical estimators make use of the full benefits offered by advanced imputation techniques. This tutorial illustrates techniques and architectures that can incorporate advanced imputation techniques into machine learning pipelines including artificial neural networks.
Add to your machine learning arsenal with an introduction to simulation in Python using SimPy! Simulations are increasingly important in machine learning, with applications that include simulating the spread of COVID-19 to make decisions about public policy, vaccination and shutdowns.
You can use simulation to answer questions like, Can you increase profits by adding more tables or staff to your restaurant? You can also use simulation to create data for modeling when it's hard or impossible to get (e.g. simulate purchases in response to promotions on certain products to see if they increase sales).
To benefit from this talk, you'll need to know a small amount of Python, specifically how to write functions and simple classes. No previous knowledge of simulation needed! If you know about simulation in another language and want to see a SimPy example, you can also benefit from this talk. You'll get a Jupyter notebook with a simple but fully worked out example to follow along with and to study on your own time after the conference.
Daft is an open-sourced distributed dataframe library built for "Complex Data" (data that doesn't usually fit in a SQL table such as images, videos, documents etc).
Experiment Locally, Scale Up in the Cloud
Daft grows with you and is built to run just as efficiently/seamlessly in a notebook on your laptop or on a Ray cluster consisting of thousands of machines with GPUs.
Pythonic
Daft lets you have tables of any Python object such as images/audio/documents/genomic files. This makes it really easy to process your Complex Data alongside all your regular tabular data. Daft is dynamically typed and built for fast iteration, experimentation and productionization.
Blazing Fast
Daft is built for distributed computing and fully utilizes your all of your machine's or cluster's resources. It uses modern technologies such as Apache Arrow, Parquet and Iceberg for optimizing data serialization and transport.
Vaex is an incredibly powerful DataFrame library that allows one to work with datasets much larger than RAM on a single node. It combines memory mapping, lazy evaluations, efficient C++ algorithms, and a variety of other tricks to empower your off-the-shelf laptop and make it crunch through a billion samples in real time.
A common use-case for Vaex is as a backend for data apps, especially if one needs to process, transform, and visualize a larger amount of data in real time. Vaex implements a number of features that have been specifically designed to improve performance of data hungry dashboards or apps, namely:
- caching
- async evaluations
- early stopping of operations
- progress bars
In this talk we will showcase how you can use these features to build efficient dashboards and data apps, regardless of the data app library you prefer using.
With Python emerging as the primary language for data science, pandas has grown rapidly to become one of the standard data science libraries. One of the known limitations in pandas is that it does not scale with your data volume linearly due to single-machine processing.
Pandas API on Spark overcomes the limitation, enabling users to work with large datasets by leveraging Apache Spark. In this talk, we will introduce Pandas API on Spark and help you scale your existing data science workloads using that. Furthermore, we will share the cutting-edge features in Pandas API on Spark.
"It works on my machine"... those dreaded words.
"I'm not a developer, I don't know how to test"... arghhh.
"Let QA test it"....
No more excuses. Learn how to debug and test Pandas code.
Quincy Larson is the Founder of freecodecamp.org.
Transformer models are all around in the deep learning community and this talk will help to better understand why transformers achieve such impressive results. Using various explainability techniques and plain numpy examples, participants will gain an understanding of the attention mechanism, its implementation, and how it all comes together.
Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.
This talk will show you a simple yet effective technique to visualize larger-than-memory datasets on your laptop by leveraging SQLite or DuckDB. No need to spin up a Spark cluster!
Join us for the traditional PyData Pub Quiz, hosted by quizmasters James Powell and Cameron Riddell. The event is open to everyone and will be located in Gather.
Why is the process of transforming research into a “real world” product so full of question marks? We often know where the research journey starts but have uncertainty about how and WHEN it ends.
In this talk, I will share my own experience leading algorithmic teams through the cycle of research into the production of live-streaming AI products. I will also share how to mitigate between agile incremental delivery and giant leaps forward that require longer research. How understanding the minimum viable product (MVP) way of thinking can help not only managers but every developer. Learn to outline MVP for new AI capabilities, and move forward with production in mind, while always raising the quality standards. At the end of this session, you will get the boost you need to take the data-driven experimental mindset to the next level, spiced with methodologies you can adapt to development as well as research.
The value of an ML model is not realized until it is deployed and served in production. Building an ML application is more challenging compared to a traditional application due to the added complexities from models and data in addition to the application code. Using web serving frameworks (e.g. FastAPI) can work for the simple cases but falls short for performance and efficiency. Alternatively, using pre-packaged models servers (e.g. Triton Inference Server) can be ideal for low-latency serving and resource utilization but lacks flexibility in defining custom logic and dependency. BentoML abstracts the complexities by creating separate runtimes for IO-intensive preprocessing logic and compute-intensive model inference logic. Simultaneously, BentoML offers an intuitive and flexible Python-first SDK for defining custom preprocessing logic, orchestrating multi-model inference, and integrating with other frameworks in the MLOps ecosystem.
Model traceability and reproducibility are crucial steps when deploying machine learning models. Model traceability allows us to know which version of the model generated which prediction. Model reproducibility ensures that we can roll back to the previous versions of the model anytime we want.
We, as ML engineers, designed reusable workflows which enable data scientists to follow these two principles by design.
Automatic Speech recognition (ASR) is used in many devices to identify Bilingual speech data. Bilingual language or in more scientific terms a code switched language is one or more languages being mixed in a speech utterance. In this presentation, learn about different deep learning techniques that can be used for the classification of such speech utterances. If you are a beginner in this field and don't know where to start, join me to explore this use case and learn something new!
Named entity recognition models might not be able to handle a wide variety of spans, but Spancat certainly can! Within our open-source library for NLP, spaCy, we've created a NER model to handle overlapping and arbitrary text spans. Dive into named entity recognition, its limitations, and how we've solved them with a solution-focused talk and practical applications.
Programmers, regardless of their level of experience, enjoy solving increasingly complex challenges within their domains of expertise, and one of the main reasons they can spend more time working on different challenges is because of the workflows they put in place around their projects. Data Engineers build pipelines to make sure the company's data is in optimal condition for Analysts to answer business critical questions, for Data Scientists to automate the selection, engineering, and analysis of distinct features before training models, and for machine learning engineers to know where to get data from, or send it to, for the APIs they build. On the other hand, developers automate the infrastructures of software products to reduce time to market of new features. These groups of data professionals and engineers are not too foreign to each other as they all speak the same language, Python. That said, the goal of this workshop is to dive deep into different workflow patterns for building pipelines for data and machine learning projects. In other words, this workshop bridges the gap between building one-off projects and building automated and reusable pipelines, all while creating an environment that welcomes both, newcomers and experts to either the data and machine learning fields or the engineering one.
What would the sunset painted by van Gogh look like? And the front of your house? This is entirely possible with Deep Learning. The Neural Style Transfer technique aims to compose images in the style of another image, modifying the content and saving it at the same time.
In this lecture, the concepts of Deep Learning, neural networks, and the step-by-step to carry-out styles transfer will be introduced.
Abstract
There are many stories about Data Science hires that end up working in silos, buried in ad hoc business requests. According to Gartner, only 20% of analytic insights will deliver business outcomes in 2022. And a large number of Machine Learning Models never go to production. On top of that, work satisfaction among data professionals is staggeringly low; for instance, 97% of data engineers reported feeling burnt out in a 2021 Wakefield Research Survey. Furthermore, despite living in the era of information, many business executives are making decisions based on guesswork because of the need for more relevant data access in a timely fashion. This talk covers why many data initiatives fail and, more importantly, how to prevent it. I lay out a number of practical approaches based on work experience that will help you to unlock the potential of data and analytics — from how to build the case and gain buy-in to promoting a fact-based decision-making culture. This talk is for you if you are a business leader sponsoring data initiatives, if you work in data applications, or if you would benefit from enhanced analytics.
Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.
We learn about the world from data, drawing on a broad array of statistical and inferential tools. The problem is that causal reasoning is needed to answer many of our questions, but few data scientists have this in their skill set. This talk will give a high-level introduction to aspects of causal reasoning and how it is complemented by Bayesian inference. A worked example will be given of how to answer what-if questions.
The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.
In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, such as visually inspecting histograms, applying statistical tests and ensuring quality with data validation checks.
Requirements: Access to Google Colab Environment
Additional Material: https://colab.research.google.com/drive/1xOcAq8NwPazmQFhXVEvzRxXw5LiFqvfj?usp=sharing
It’s common to hear about demand forecasting in the e-commerce ecosystem. Indeed, It plays a pivotal role in logistics and inventory applications. However, due to uncertainty impacting demand and the stochastic nature of most downstream applications, the need for probabilistic demand forecasting emerges. Moreover, for the most realistic use cases, you’ll have to forecast for thousands if not hundreds of thousands of time series. The problem we will explore together is: how can we get probabilistic forecasts that embrace uncertainty and scale?
The talk is light-hearted, contains few math formulas, and is aimed at forecasting practitioners! If you are new to the topic of forecasting, you'll be able to follow! We take the time to pose the problems and develop deeper from there.
In this talk we present Hamilton, a novel open-source framework for developing and maintaining scalable feature engineering dataflows. Hamilton was initially built to solve the problem of managing a codebase of transforms on pandas dataframes, enabling a data science team to scale the capabilities they offer with the complexity of their business. Since then, it has grown into a general-purpose tool for writing and maintaining dataflows in python. We introduce the framework, discuss its motivations and initial successes at Stitch Fix, and share recent extensions that seamlessly integrate it with distributed compute offerings, such as Dask, Ray, and Spark.
Media Mix Modeling, also called Marketing Mix Modeling (MMM), is a technique that helps advertisers to quantify the impact of several marketing investments on sales.
If a company advertises in multiple media (TV, digital ads, magazines, etc.), how can we measure the effectiveness and make future budget allocation decisions? Traditionally, regression modeling has been used, but obtaining actionable insights with that approach has been challenging.
Recently, many researchers and data scientists have tackled this problem using Bayesian statistical approaches. For example, Google has published multiple papers about this topic.
In this talk, I will show the key concepts of a Bayesian approach to MMM, its implementation using Python, and practical tips.
The energy sector has gained great attention in 2022 due to the current global energy crisis. Understanding which technologies and techniques are suitable for this sector is crucial to guarantee an effective transition to a future with cleaner and efficient energy sources. This talk aims to educate tech professionals interested in the applications of machine learning in the energy sectors, especially when it comes to time series analysis and forecasting. The audience is expected to have a basic understanding of data science and machine learning, and will be introduced to the concepts of time series, as well as the most common techniques utilized in the sector.
Throughout the COVID pandemic, we’ve experienced extremes brought on by economic downturns and uncertainty across industries—to this day, we are feeling these effects around the globe. In fact, statistics show that many professionals have changed careers following the waves of layoffs that have recently occurred—but how? How can we best prepare for this type of situation, and how easy or difficult is it to change careers? If these questions have been on your mind, join this session to learn about several global industry trends, ways to adapt to career changes, and how to grow your tech skills and leverage certain platforms to support your learning process.
MLOps encapsulates the discipline of – and infrastructure that supports – building and maintaining machine learning models in production. This tutorials highlight four challenges in carrying this out effectively: scalability, data quality, reproducibility, recoverability, and auditability. As a data science and machine learning practitioner, you’ll learn how Flyte, an open source data- and machine-learning-aware orchestration tool, is designed to overcome these challenges and you'll get your hands dirty using Flyte to build ML pipelines with increasing complexity and scale!
Inspirational sports speeches have motivated and reinvigorated folks for years. Whether you’re a developer or an athlete, they’ve withstood the journey because even the smartest, the bravest, and the most resilient need some encouragement on occasion.
During our time together, we’ll use Python and a speech-to-text provider to transcribe sports podcasts that contain inspirational speeches. We’ll discover insights from the transcripts to determine which ones might give you a boost of energy or rally your team.
We’ll discover common topics of each sports podcast episode and measure how they leave us feeling: victorious or perhaps overcoming the agony of defeat. We’ll also investigate if there are any similarities and differences in the sports speeches and what makes a great motivational speech that moves people to action.
By the end, you’ll have a better understanding of using speech recognition in real-world scenarios and using features of Machine Learning with Python to derive insights.
This talk is for developers of all levels, including beginners.
You need to quickly process a large amount of data—but running Python code is slow.
Libraries like NumPy and Pandas bridge this performance gap using a technique called vectorization.
In order take full advantage of these libraries to speed up your code, it's helpful to understand what vectorization means and when and how it works.
In this talk you'll learn what vectorization means (there's 3 different definitions!), how it speeds up your code, and how to apply it to your code.
Bad data is likely the largest factor limiting your model's performance. We'll talk about common data errors and how you can fix them today using Galileo. Although the majority of examples used will be in CV and NLP, the same insights apply to other modalities!
The electrochemical battery is one of the most important technologies for a renewable future. In this beginner-friendly talk, we will walk through how fundamental quantum mechanics and data science inform how we fine-tune battery materials for higher performance. We will also show how we used these techniques to computationally model a lithium-oxygen battery in Python.
The International Monetary Fund (IMF) provides a huge variety of economic datasets from different countries. We have explored the Python API for data extraction from the IMF, which allows users (primarily economists or financial analysts) to access the data. The structure of the underlying JSON datasets is quite complex for an unprepared user. In the talk, we will demonstrate the API workflow and go over the issues that we are designing a new, easier-to-use API, which is currently being developed. This is joint work with Dr. Sou-Cheng Choi (Illinois Institute of Technology and SAS Institute Inc.).
The talk is primarily directed at data analysts and economists interested in utilizing IMF's macroeconomic data.
There’s a growing interest from small and large companies alike to move their data and their analytical pipelines into the Cloud as it adds large cost and operational benefits to businesses. Despite this, it can be unclear and sometimes confusing to know how cloud services can be used to replicate your existing analytical solutions in the Cloud or even how services can fit together to build new solutions.
The goal of this talk is to help answer these two questions. First by explaining what modern analytics look like in cloud environments and then by presenting a live use case for building an end-to-end analytical solution in the context of fraud detection for E-commerce businesses.
This talk will assume knowledge in some areas, such as the Hadoop ecosystem and the main tools used such as Airflow, Kafka, Spark, etc. an overall idea will be more than sufficient and some experience with building and deploying machine learning models (some MLOps experience). Therefore, the target audience would be data scientists/engineers with 4-5 years of experience working in analytics and/or architects looking to move their analytics solutions to the Cloud but are still unsure how it can fit together.
At the end of the talk, the audience will have a clear understanding of how modern analytics can be performed in the cloud and what a typical modern data architecture looks like. In the context of AWS, the audience will also have an understanding of the AWS analytics service offerings and what services can be used for/tailored to their needs. Finally, the audience will gain a clearer idea of how they can leverage ML capabilities to build a full pipeline in the cloud while cutting their development time by half.
The proposed outline for the talk will follow the description below:
The evolution of analytics from the 90s to current day (2-3 mins)
Modern analytics in the Cloud - what’s available (4-5 mins)
How analytics is done in the Cloud - tools to help manage the cloud solutions (5 mins)
Case study - Fraud Detection for Ecommerce (2-3 mins)
Refresher concepts (3 mins)
Breaking down the architecture (6-7 mins)
Scaling and improving the solution (5-6 mins)
Quarto is an open-source scientific and technical publishing system that builds on standard markdown with features essential for scientific communication. The system has support for reproducible embedded computations, equations, citations, crossrefs, figure panels, callouts, advanced layout, and more. In this talk we'll explore the use of Quarto with Python, describing both integration with IPython/Jupyter and the Quarto VS Code extension.
"Is a lion closer to being a giraffe or an elephant?"
It is not a question anyone asks.
So why address that classification problem the same as you would classification of age groups or medical condition severity?
The talk will walk you through a review of regression-based approaches for what may seem like classification problems. Unlock the true potential of your labels!
Hugging Face Transformers is a popular open-source project with cutting edge Machine Learning (ML), but meeting the computational requirements for advanced models it provides often requires scaling beyond a single machine. In this session, we explore the integration between Hugging Face and Ray AI Runtime (AIR), allowing users to scale their model training and data loading seamlessly. We will dive deep into the implementation and API and explore how we can use Ray AIR to create an end-to-end Hugging Face workflow, from data ingest through fine-tuning and HPO to inference and serving.
Pia is the co-founder and CEO of Open Collective.
Ok, I lied, I still write tests. But instead of the example-based tests that we normally write, have you heard of property-based testing? By using Hypothesis, instead of thinking about what data I should test it for, it will generate test data, including boundary cases, for you.
Lightning Talks are short 5-10 minute sessions presented by community members on a variety of interesting topics.
How can we make smart decisions when optimizing a black-box function?
Expensive black-box optimization refers to situations where we need to maximize/minimize some input–output process, but we cannot look inside and see how the output is determined by the input.
Making the problem more challenging is the cost of evaluating the function in terms of money, time, or other safety-critical conditions, limiting the size of the data set we can collect.
Black-box optimization can be found in many tasks such as hyperparameter tuning in machine learning, product recommendation, process optimization in physics, or scientific and drug discovery.
Bayesian optimization (BayesOpt) sets out to solve this black-box optimization problem by combining probabilistic machine learning (ML) and decision theory.
This technique gives us a way to intelligently design queries to the function to be optimized while balancing between exploration (looking at regions without observed data) and exploitation (zeroing in on good-performance regions).
While BayesOpt has proven effective at many real-world black-box optimization tasks, many ML practitioners still shy away from it, believing that they need a highly technical background to understand and use BayesOpt.
This talk aims to dispel that message and offers a friendly introduction to BayesOpt, including its fundamentals, how to get it running in Python, and common practices.
Data scientists and ML practitioners who are interested in hyperparameter tuning, A/B testing, or more generally experimentation and decision making will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, multivariate normal distributions, etc.
Let’s scratch the twitter meta-data together and go below the surface with tweepy. Want to find out if the tweets you follow are trying to persuade you to do things? Have the feeling the advocates for some issues use certain emotions to push you in certain directions? Now you can find out