PyData Global 2022

Crowd-Kit: A Scikit-Learn for Crowdsourced Annotations
12-01, 13:30–14:00 (UTC), Talk Track II

The talk includes the presentation of Crowd-Kit - an open-source computational quality control library - followed by its demonstration.
Crowdsourced annotations in most cases require post-processing due to their heterogeneous nature; raw data contains errors, is biased and non-trivial to combine. Crowd-Kit provides various methods like aggregation, uncertainty, and agreements, which could be used as helping tools in getting an interpretable result out of data labeled with the help of crowdsourcing.


A huge amount of data for machine learning is gathered through crowdsourcing pipelines. Crowdsourcing is a useful tool when one needs to collect voluminous problem-specific datasets for training/testing/monitoring ML models in a relatively short period of time. However, aggregation of the results is untrivial when it comes to tasks other than classification, and raw results require processing to extract the real value. Crowd-Kit is a library designed to tackle crowdsourced video, audio, image, textual, and categorical data types with an interface similar to scikit-learn.


Prior Knowledge Expected

Previous knowledge expected

Evgeniya is a Data Evangelist at Toloka: data labelling platform for machine learning pipelines, used world-wide by approximately 2,000 large and small businesses.
Her career path is made up of being an analyst-developer, an machine learning engineer, a solution architect and a business analyst, including 2 years experience of working with crowdsourcing. Evgeniya’s background is in Artificial Intelligence & Data Engineering, she’s currently doing her masters at Technical University of Munich.