PyData Global 2022

Too much data? When big data starts to become a bad idea
12-02, 16:00–18:00 (UTC), Workshop/Tutorial II

Nowadays we know the social media and tech giants are honesting tons of data from their users and most of us agree that the capability of these companies to deliver their suggestions and customization for you is driven by big data.

However, this brings a question: Is more data always better? Do more data equal to more accurate model? When do you need big data and when does it start becoming a bad idea? Let's find out in this panel session.


Nowadays we know the social media and tech giants are honesting tons of data from their users and most of us agree that the capability of these companies to deliver their suggestions and customization for you is driven by big data.

However, this brings a question: Is more data always better? Do more data equal to more accurate model? When do you need big data and when does it start becoming a bad idea?

While big data seems like a golden ticket to solving all the problems, it also requires more resources to manage and make use of it. Also, more data means does not mean better data quality, it may be the opposite, the more data means it is harder to maintain good data quality.

Being able to judge how much data we want and knowing when to stop seems like a thing that many of us overlook. In this panel session, we will invite leaders in the industry to discuss whether or not we should always aim for more data and if not, when to stop. We will also talk about how to judge how much data we need and what to do if we have too much or too little data.

Panelists:
- Katrina Riehl, PhD - Head of Streamlit Data Team @ Snowflake, President of Board of Directors @ NumFOCUS
- Jesper Sören Dramsch
- Alexander Hendorf
- John Sandall - CEO & Principal Data Scientist @ Coefficient


Prior Knowledge Expected

No previous knowledge expected

Before working in Developer Relations, Cheuk has been a Data Scientist in various companies which demands high numerical and programmatical skills, especially in Python. To follow her passion for the tech community, now Cheuk is the Developer Advocate for Anaconda.

Besides her work, Cheuk enjoys speaking at various conferences. Cheuk also organises events for developers. Cheuk has organised conferences including EuroPython (of which she is a board member), PyData Global and Pyjamas Conf. Believing in Tech Diversity and Inclusion, Cheuk constantly organizes workshops and mentored sprints for minority groups. In 2021, Cheuk has become a Python Software Foundation fellow.

This speaker also appears in:

Jesper Dramsch works at the intersection of machine learning and physical, real-world data. Currently, they're working as a scientist for machine learning in numerical weather prediction at the coordinated organisation ECMWF.

Before, Jesper has worked on applied exploratory machine learning problems, e.g. satellites and Lidar imaging on trains, and defended a PhD in machine learning for geoscience. During the PhD, Jesper wrote multiple publications and often presented at workshops and conferences, eventually holding keynote presentations on the future of machine learning.

Moreover, they worked as consultant machine learning and Python educator in international companies and the UK government. Their courses on Skillshare have been watched over 30 days by over 5000 students. Additionally, they create educational notebooks on Kaggle, reaching rank 81 worldwide. Recently, Jesper was invited into the Youtube Partner programme.

This speaker also appears in:

Alexander Hendorf runs his own company opotoc providing expertise e.g. for data and artificial intelligence e.g. at the digital excellence consultancy KÖNIGSWEG. Through his commitment as a speaker and chair of various international conferences, he is a proven expert in the field of data intelligence. He has many years of experience in the practical application, introduction and communication of data and AI-driven strategies and decision-making processes in the industry. He is a Python Software Foundation Fellow, EuroPython Fellow, likes to work in small dedicated teams and loves to work with and contribute to the Python and PyData community. Twitter LinkedIn

Katrina is the Head of the Streamlit Data Team at Snowflake. She is joining Georgetown University as adjunct faculty this Spring. She also volunteers as the President of the Board of Directors at NumFOCUS, a non-profit supporting the PyData open source ecosystem. For almost two decades, Katrina has worked extensively in the fields of scientific computing, machine learning, data mining, and visualization. Most notably, she has helped lead data science efforts at the University of Texas Austin Applied Research Laboratory, Apple, HomeAway (now, Vrbo), and Cloudflare. Katrina received MS and PhD degrees in Computer Science from the University of Texas at Dallas.

John Sandall is the CEO and Principal Data Scientist at Coefficient.

His experience in data science and software engineering spans multiple industries and applications, and his passion for the power of data extends far beyond his work for Coefficient’s clients. In April 2017 he created SixFifty in order to predict the UK General Election using open data and advanced modelling techniques. Previous experience includes Lead Data Scientist at YPlan, business analytics at Apple, genomics research at Imperial College London, building an ed-tech startup at Knodium, developing strategy & technological infrastructure for international non-profit startup STIR Education, and losing sleep to many hackathons along the way.

John is also a co-organiser of PyData London, co-founded Humble Data in 2019 to promote diversity in data science through a programme of free bootcamps, and in 2020 was a Committee Chair for the PyData Global Conference. He is currently a Fellow of Newspeak House with interests in open data, AI ethics and promoting diversity in tech.