PyData Global 2022

Text to Data: Make Your Code Malleable, Not Brittle
12-01, 18:30–19:00 (UTC), Talk Track I

Extracting the highly valuable data from unstructured text often results in hard-to-read, brittle, difficult-to-maintain code. The problem is that using regular expressions directly embedded in the program control flow does not provide the best level of abstraction. We propose a query language (based on the tuple relational calculus) that facilitates data extraction. Developers can explicitly express their intent declaratively, making their code much easier to write, read, and maintain.


We allow the programmer to express what they are searching for by using higher-level concepts to express their query as tags, locations and expressions on location relations.

The location of a string of characters within the document is the interval defining its starting and ending position. Locations are grouped into sets and these sets are associated with tags. Tags can be used in conjunctions and disjunctions of interval relations to query for tuples of locations. Consider extracting the date and email from email threads of the form:

"On Wed, Oct 30, 2019 at 10:00 am Jane Doe jane.doe@hotmail.com wrote:"

Tags ON and WROTE can be associated with locations of strings "On" and "wrote:" in the body of the email. Tags DATE and EMAIL can be associated with dates and emails in the email body. We can find 4-tuples l1, l2, l3, l4 of locations in ON, WROTE, DATE, and EMAIL, respectively; such that each tuple satisfies the following predicate: seq_before(l1, l3, l4, l2) and distance(l1,l2)<100. Interval relation "seq_before" is satisfied if l1 < l3 < l4 < l2. Function "distance" computes the number of characters between the end of location l1 and the beginning of l2. This predicate selects 4-tuples that follow the pattern of the email thread above. To select only the date and email, we project the 4-tuple to (l3, l4).

In this talk, we will describe the query language in detail and its implementation in python. We will present a jupyter notebook with examples that illustrate how functionality changes and enhancements can be done by changing old queries or adding new queries, which makes the code much easier to maintain.


Prior Knowledge Expected

No previous knowledge expected

After earning his Ph.D., David Barrett has been teaching and applying results from computer-science research to engineer solutions for large-scale software problems for the last twenty years. His expertise includes machine learning, software systems, networks, databases, programming languages and compiler-construction. He has presented talks at peer-reviewed academic conferences on data layouts and retrieval scheduling for multimedia, and dynamic memory allocation for programming languages. He has also presented on document databases, and software vulnerabilities at NoSQLNow and the Open Source Summit.

Martha's professional experience in research and development spans decades of work in both academia and industry. After earning her PhD, she worked on research in databases, multimedia systems, machine learning and natural language processing. Her research work has been published in peer-reviewed journals and academic conference proceedings. Her development experience spans multiple industries, including: high performance parallel systems, cybersecurity, fintech, and healthcare.