PyData Global 2022

Data Prep for Graphs
12-01, 21:30–22:00 (UTC), Talk Track I

Data science practitioners have a saying that a 80% of their time gets spent on data prep. Often this involves tools such as Pandas and Jupyter. Graph Data Science is similar, except the data prep techniques are highly specialized and computationally expensive. Moreover, data prep for graphs is required before commercial tools such as graph databases or visualization can be used effectively. This talk shows examples of data prep for graphs. A progressive example illustrates the challenges plus techniques that leverage open source integrations with the PyData stack: Arrow/Parquet, PSL, Ray, Keyvi, Datasketch, etc.


Graph technologies and use cases are growing in popularity in industry. Open source libraries are available for graph data science, which integration with the PyData stack and related practices. Tools such as graph databases, visualization, etc., tend to take center stage in discussions about graph technologies.

However – and this is a relatively BIG "however" – similar to what was recognized a decade ago when data science become mainstream practice, so much time and effort and cost must go into data preparation long before these other tools downstream can be used effectively.

In the early-ish days of Big Data, many commercial database vendors claimed to provide full suites for data science work. Practitioners found that, in contrast, they spent more of their time working in data wrangling, often using tools such as Pandas. This has become the proverbial 80% of data science.

Graph data science is no exception to this rule. Case in point, data visualization tools can render beautiful representations from nearly raw data. Unfortunately, without careful preparation, the beautiful renderings become expensive wallpaper since they don't lead to meaningful outcomes. For example, if a large dataset contains many cycles for a business process where these are undefined (e.g., supply networks) or it contains many duplicates (e.g., slight variations of vendor or author names) then we can get pretty pictures, but not meaningful analysis.

Unfortunately, data preparation techniques for graphs such cycle detection, similarity analysis, transitive closure, and unique identifier assignment often involve graph algorithms or distributed data structures which are computationally hard problems, expensive to perform, and not supported well at scale by the commercial graph databases.

This talk shows examples of data preparation for graphs, along with an overview of typical graph use cases in industry in which these need to be used. We'll show a progressive example based on recipe data (analogous to customer data in manufacturing) along with use of the PyData stack and other open source integrations such as Ray, Keyvi, Datasketch, Arrow/Parquet, PSL, etc., which help alleviate bottlenecks at scale when working with large graphs.


Prior Knowledge Expected

No previous knowledge expected

Managing Partner at Derwen, Inc. Known as a "player/coach", with core expertise in graph technologies, natural language, data science, cloud computing; ~40 years tech industry experience, ranging from Bell Labs to early-stage start-ups. Board member for Recognai; Advisor for Amplify Partners, Data Spartan, KUNGFU.AI. Lead committer on PyTextRank, kglab. Formerly: Director, Community Evangelism for Apache Spark at Databricks.