PyData Global 2022

The Dask at Hand: Using Dask to Speed up the High Quality Transit Areas dataset for the CA Open Data Portal.
12-01, 22:30–23:00 (UTC), Talk Track I

Where are CA’s frequent, high quality transit corridors? The CA Public Resources Code defines it, but it requires continued access of the General Transit Specification Feed (GTFS) data and fairly complex geospatial processing. The Integrated Travel Project within Caltrans tackles this by leveraging the combined powers of Dask and Python to make this dataset publicly available and updated monthly on the CA open data portal.


Where are CA’s frequent, high quality transit corridors? The CA Public Resources Code defines it, but it requires continued access of the General Transit Specification Feed (GTFS) data and fairly complex geospatial processing. Luckily, the Integrated Travel Project, within Caltrans, has a pipeline of GTFS Schedule data, and is building out its pipeline of GTFS Real Time data. GTFS data is public, and it only makes sense that we make more of the GTFS data products publicly available and accessible.

GTFS provides all the details about scheduled transit service on a given day. We simply wanted to know whether certain areas exceeded the threshold of 15 min frequency bus service and also where rail, ferry, and bus rapid transit stops were, and make that available to the public. Is that too much to ask? Turns out, it required geospatial processing to slice the bus route network into equally sized 1,250 meter segments, and counting how many bus arrivals occurred on that segment each hour. The multiple stages of data processing meant we couldn’t simply dump all the GTFS contents from the warehouse into the open data portal.

At minimum, we wanted to refresh the open data portal dataset monthly, but Python alone left us strapped spending upwards of 6 hours in computation time. No can do! Even without expanding our computation resources, we leveraged Dask to cut down the computation time by more than half.

This talk is for data wonks with some Python or GIS background who aren’t afraid of wading into the details of a necessary “conceptual rewrite” to use Dask. As a recent user of Dask, I am by no means an expert, but am a learner and explorer of new tools, and will give this talk from a Dask beginner’s perspective. It will be informative, covering concepts of data processing and wrangling with a sprinkling of syntax.

The talk will cover how we interpreted and translated the statutory laws into Python code, identify the most time-consuming portions of the workflow, and highlight our business conditions and resource constraints that pushed us to embrace Dask. Specifically, it will cover how we used dask and dask_geopandas to extend our use of Python’s pandas and geopandas. It will also discuss why some limitations in dask_geopandas led us to a different conceptual rewrite of the code to fully utilize Dask.


Prior Knowledge Expected

No previous knowledge expected

Tiffany is a data scientist at Caltrans, working on the Integrated Travel Project team. Prior to Caltrans, she did transportation and data-related work at the City of Los Angeles, Cambridge Systematics, and LA Metro.