PyData Global 2022

Data annotation for humans: Creating and refining annotation guidelines from a UX perspective
12-02, 16:00–18:00 (UTC), Workshop/Tutorial I

In this workshop, attendees will learn how to create data annotation guidelines from a user experience (UX) perspective.

Creating annotation guidelines from a UX perspective means imbuing them with usability, resulting in a better experience for annotators, and more effective and productive annotation campaigns. With Python being at the forefront of Machine Learning and data science, we believe that the Python community will benefit from learning more about the design of data annotation guidelines and why they are essential for creating great machine learning applications.


What is this workshop about?

Data annotation, or the process of adding structured information to raw data, is the invisible star player of many machine learning tasks such as sentiment analysis, named-entity recognition, and question answering, among others. If annotations are low-quality, they may be costly to redo as manual work can be expensive and time-consuming. Annotation guidelines, in turn, are technical documents meant to be used by annotators to define the task at hand, help resolve challenging annotation cases, and in general, direct annotation efforts across annotators for consistency and reliability. These documents are an essential component of many machine-learning projects, and there is evidence that they can indirectly impact annotation outcomes.

Data annotation can be complex and labor-intensive. Annotators rely heavily on annotation guidelines as they spend tens or even hundreds of hours in front of a computer labeling data. Despite their importance, annotation guidelines are not always designed with usability in mind; they often lack adequate structure, are difficult to navigate, and their wording can be ambiguous or confusing. As we will see, these shortcomings can directly impact annotation quality and create unnecessary stress and frustration for both annotators and annotation managers.

In this workshop, attendees will learn the basics of human-centered design and how we can apply UX research techniques to create robust and easier-to-use annotation guidelines. The workshop will include one theoretical and two practical components. The theoretical component will deal with the basic outline of language annotation guidelines, show examples of dry or difficult-to-use annotation guidelines, and introduce UX concepts and techniques that can help us rethink how to design better annotation guidelines.

The practical components will deal with the hands-on creation and testing of annotation guidelines. Attendees will experience both sides of the annotation task: the managerial side and the annotators’ side. For data annotation, we will provide attendees with remote access to Prodigy for the workshop duration. Prodigy is a commercial Python-based tool for data annotation. However, none of the content in this workshop depends on a particular tool, and attendees should be able to apply their newly acquired skills and best practices with whichever tool they choose to work with in the future.

After this workshop, attendees can expect a working knowledge of several UX research techniques and practical experience in creating and improving annotation guidelines.

Curriculum

Introduction:
- Topic #1: Why is data annotation important? (Theoretical)
- Topic #2: What is it like to be an annotator? (Hands-on with data annotation tool)
- Expected duration: 15 minutes

Part one: Annotation guidelines and UX design (Theoretical with group activities)
- Expected duration: 25 minutes

Part two: Defining data annotation tasks. A practical perspective. (Hands-on)
- Expected duration: 15 minutes

1-hour mark
- 5-minute recess + 5-minute QA

Part three: Annotation guidelines re-design. Put your new UX knowledge into practice. (Hands-on)
- Expected duration: 20 minutes

Part four: Iterative reliability testing: using annotator agreement to finalize your guidelines
- Expected duration: 15 minutes

Final thoughts and QA
- Expected duration: 10 minutes

Requirements

  • No computer programming knowledge or language annotation experience is required.
  • Attendees experienced in data annotation will still find benefit from learning how to apply UX principles to the creation of annotation guidelines.
  • Workshop materials will be distributed during the workshop and made available publically via the "resources" after the workshop.

About us

The facilitators are computational linguists, machine learning engineers, and NLP practitioners with experience creating annotation guidelines for academia and industry.


Prior Knowledge Expected

No previous knowledge expected

Damian Romero is a Ph.D. candidate in Hispanic Linguistics, focusing on corpus linguistics, computational linguistics, and natural language processing, with experience in English, Spanish, and Portuguese projects for academia and industry. He studies politeness and impoliteness in human interactions through computational linguistic methods. His secondary research interests are data annotation guidelines, UX research, and language evolution. He is also a machine learning intern at Explosion and a contributor to the Digital Humanities project Digital Tolkien (https://digitaltolkien.com/).