Designing a data pipeline
I'm building a geospatial data pipeline with Google Cloud Platform. My exposure to GCP is minimal as I grew up in the AWS world. It's been a breath of fresh air to have a new take on cloud computing.

The Problem


I have a nationwide (continental US) dataset of polygons. The order of magnitude is 10s of millions.  My org needs to run complex queries across this dataset, but only after adding key characteristics that are derived from heterogeneous data sources: Shapefiles, GeoTIFF, CSV...to name a few. At scale, I expect the data to be < 100 TB.

Goals


  1. Cheap - compute is on-demand, storage is little to no cost
  2. Fast - Distributed data processing
  3. Idempotent - I can tweak logic and re-run with no worries

Solution


After evaluating a few options from AWS & GCP, I settled on GCP as my cloud provider. BigQuery is a polished product that enables non-engineer folk to get at the data they need and a pricing model that won't break the bank at a startup. How to get that data into BigQuery? Enter the data pipeline.

Getting data into BigQuery
  1. Upload data source to GCS (equivalent of AWS S3)
  2. Create Apache Beam(s) for ETL processing
  3. Run beam in Google Dataflow
There are a lot of spatial capabilities that Geobeam doesn't support, primarily for raster analysis. I'll add some future highlights once I have that piece of the puzzle worked out.