Streaming Analytics on Google Cloud for Regulated Industries

While adopting cloud computing, one of the key asks especially within regulated industries (e.g. Finance, Healthcare, Government, Telco etc.) is the ability to have full control over the data. In my experience with Financial Services teams,  full control over data usually translates to 

  1. Controlling the location of data processing and data storage systems
  2. Controlling the keys to encrypt data pipeline 

Having full control on location and encryption is a relatively well established mode of operation for storage systems both on and off premise. However managing location and encryption for streaming analytics requires a little more consideration because it involves multiple components in the processing AND in a streaming analytics you need to protect data at rest as well as during processing.

This blog post is my attempt to demonstrate how a streaming analytics pipeline on Google Cloud using PubSub, Apache Beam (on Dataflow runner), Cloud Storage and BigQuery can be executed in a single region and protected end to end using Customer Managed Encryption key (CMEK). The pipeline is fairly simple as shown below. 

GCP Simple Streaming PipelineTo execute this pipeline, I need to set up encryption key ring and keys as well as the source, sinks and other components.

Set Project Level metadata

PROJECT_ID=XXXXXXXXXXXXXX
PROJECT_NUMBER=nnnnnnnnnnn
REGION=europe-west2
TOPIC_NAME=str-pl-pubsub-topic
DATASET_NAME=da_streaming_pipeline
TABLE_NAME_1=cc_account_info
STAGING_LOCATION=gs://da_streaming_pipeline/stage
TEMP_LOCATION=gs://da_streaming_pipeline/temp
export PROJECT_ID PROJECT_NUMBER REGION TOPIC_NAME DATASET_NAME TABLE_NAME_1 STAGING_LOCATION TEMP_LOCATION

Setup Encryption Keys

Here I will create two sets of encryption keys. One regional key for PubSub topic and BigQuery dataset and a Global key for Dataflow. Global key is required because Dataflow does not have an endpoint in europe-west2 (London). In order to process data using worker nodes in europe-west2 (London), the configuration is to use dataflow endpoint in europe-west1 (Belgium) and the zone for worker nodes in europe-west2 (London). The Global key is not required in scenarios where Dataflow worker nodes are in the same region as its regional end point.

Create a CMEK key in London which should be used by PubSub and BigQuery.

gcloud kms keyrings create str-pl-london-kr --location ${REGION}
gcloud kms keys create str-pl-london-key01 --location ${REGION} --keyring str-pl-london-kr --purpose encryption

Create a Global CMEK key which should be used by Dataflow.

gcloud kms keyrings create str-pl-global-kr --location global
gcloud kms keys create str-pl-global-key01 --location global --keyring str-pl-global-kr --purpose encryption

Grant necessary permissions to service accounts

Grant permission to encrypt and decrypt data using CMEK to service accounts for PubSub, bigquery and Dataflow

Grant permission to PubSub service account

gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-pubsub.iam.gserviceaccount.com" --role='roles/cloudkms.cryptoKeyEncrypterDecrypter'

Grant permission to BigQuery service accounts

gcloud kms keys add-iam-policy-binding --project=data-analytics-bk --member serviceAccount:bq-${PROJECT_NUMBER}@bigquery-encryption.iam.gserviceaccount.com --role roles/cloudkms.cryptoKeyEncrypterDecrypter --location=${REGION} --keyring=str-pl-london-kr str-pl-london-key01

Grant permission to Dataflow and Compute engine service accounts as required by Dataflow.

gcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:service-${PROJECT_NUMBER}@dataflow-service-producer-prod.iam.gserviceaccount.com --role roles/cloudkms.cryptoKeyEncrypterDecrypter
gcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:service-${PROJECT_NUMBER}@compute-system.iam.gserviceaccount.com --role roles/cloudkms.cryptoKeyEncrypterDecrypter

Grant permission to Storage Service account to use Cloud KMS key

gsutil kms authorize -p ${PROJECT_ID} -k projects/${PROJECT_ID}/locations/${REGION}/keyRings/str-pl-london kr/cryptoKeys/str-pl-london-key01

Setup PubSub topic for data ingestion

Build KEY Resource ID

KEY_ID=projects/${PROJECT_ID}/locations/${REGION}/keyRings/str-pl-london-kr/cryptoKeys/str-pl-london-key01
export KEY_ID

Create PubSub Topic

gcloud pubsub topics create ${TOPIC_NAME} --message-storage-policy-allowed-regions=${REGION} --topic-encryption-key=$KEY_ID

Confirm that the key has been created using

gcloud pubsub topics describe ${TOPIC_NAME}

Set up BigQuery Dataset and Table

Create a dataset (da_streaming_pipeline) in London region with default encryption using CMEK key ${KEY_ID}.

bq --location=${REGION} mk --dataset --description "dataset for streaming pipeline sink" --default_kms_key ${KEY_ID} ${PROJECT_ID}:${DATASET_NAME}
Create Sink TABLE for Raw data

bq query --use_legacy_sql=false "
   CREATE TABLE ${DATASET_NAME}.${TABLE_NAME_1}
       (
         name STRING,
         address STRING,
         country_code STRING,
         city STRING,
         bank_iban STRING,
         company STRING,
         credit_card INT64,
         credit_card_provider STRING,
         credit_card_expires STRING,
         date_record_created DATETIME,
         employement STRING,
         emp_id STRING,
         name_prefix STRING,
         phone_number STRING
       )
   OPTIONS(
      description='Sink table to collect data after processing by dataflow for streaming pipeline'
   )
"

Setup CMEK protected GCS Bucket

Setup CMEK protected GCS buckets for Staging and Temp locations

gsutil mb -c regional -l europe-west2 gs://da_streaming_pipeline
gsutil kms encryption -k projects/${PROJECT_ID}/locations/${REGION}/keyRings/str-pl-london-kr/cryptoKeys/str-pl-london-key01 gs://da_streaming_pipeline

Running a Streaming Pipeline

For this pipeline I am using Beam Python SDK version 2.16 with Python 2.7. The pipeline is fairly simple as described above. A data generator script written in Python generates Bank customer details including address and credit card information using Python package faker and pushes to PubSub topic. Code for this data generator is available from my GitHub repository

Generate Data and ingest to PubSub

./datagen.py -d pubsub -p data-analytics-bk -t str-pl-pubsub-topic -n 100000

Start Data processing pipeline using Dataflow runner

The code for this pipeline is available from my GitHub repositoryThe core pipeline code is shown below. It is worth noting that the Beam Pipeline code or Pipeline options do not have any specific instructions to deal with customer managed encryption or geo-restriction. Those are source, sink and dataflow configuration parameters.

with beam.Pipeline(options=pipeline_options) as pipeline:
   (
       pipeline
       | "Read PubSub messages" >> beam.io.ReadFromPubSub(topic=pubsub_topic)
       | "Decode PubSub messages" >> beam.Map(lambda x: x.decode('utf-8'))
       | "Convert to Dictionary" >> beam.ParDo(convertToDict())
       | "Write to BigQuery" >> beam.io.WriteToBigQuery('{0}:da_streaming_pipeline.cc_account_info'.format(project_name), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
   )

The following command will start a streaming pipeline using Google Dataflow runner.

python -m streaming_simple_pipeline \
 --project_name XXXXXXXXXXXXXX \
 --pubsub_topic projects/XXXXXXXXXXXX/topics/str-pl-pubsub-topic \
 --streaming \
 --runner Dataflow \
 --staging_location gs://da_streaming_pipeline/stage \
 --temp_location gs://da_streaming_pipeline/temp \
 --region europe-west1 --zone=europe-west2-b \
 --dataflow_kms_key projects/XXXXXXXXXXXX/locations/global/keyRings/str-pl-global-kr/cryptoKeys/str-pl-global-key01

From the Cloud Dataflow console, you can see that the pipeline is using a Customer Managed Key and it is executing in a particular zone in europe-west2 (London) GCP region.

Dataflow with CMEK and Geo-Restriction

Revoking Keys

In order to stop access to data, the encryption key can be revoked which will result in failure of Dataflow pipeline. Additionally, any attempt to access the data from sink will return an error.

Revoke keys

gcloud kms keys versions disable 1 --location europe-west2 --keyring str-pl-london-kr --key str-pl-london-key01

gcloud kms keys versions disable 1 --location global --keyring str-pl-global-kr --key str-pl-global-key01

Dataflow Pipeline error:

Error_Message_DF

Error returned from BigQuery:

Error_Message_BQ

Any other observation

Usually performance is one of the concerns when encrypting and decrypting data. In my small test, I did not observe any performance difference when using CMEK vs Google managed key for pipeline. However performance, to some extent, will depend on the type and volume of data and workload and hence worth spending some testing cycle to performance test a pipeline when using CMEK. 

Hope this post was useful for people considering developing and deploying Apache Beam based streaming pipeline on Google Cloud with geo-restriction and have a need to protect using CMEK. Thanks for reading.