While adopting cloud computing, one of the key asks especially within regulated industries (e.g. Finance, Healthcare, Government, Telco etc.) is the ability to have full control over the data. In my experience with Financial Services teams, full control over data usually translates to
- Controlling the location of data processing and data storage systems
- Controlling the keys to encrypt data pipeline
Having full control on location and encryption is a relatively well established mode of operation for storage systems both on and off premise. However managing location and encryption for streaming analytics requires a little more consideration because it involves multiple components in the processing AND in a streaming analytics you need to protect data at rest as well as during processing.
This blog post is my attempt to demonstrate how a streaming analytics pipeline on Google Cloud using PubSub, Apache Beam (on Dataflow runner), Cloud Storage and BigQuery can be executed in a single region and protected end to end using Customer Managed Encryption key (CMEK). The pipeline is fairly simple as shown below.
To execute this pipeline, I need to set up encryption key ring and keys as well as the source, sinks and other components.
Set Project Level metadata
PROJECT_ID=XXXXXXXXXXXXXX PROJECT_NUMBER=nnnnnnnnnnn REGION=europe-west2 TOPIC_NAME=str-pl-pubsub-topic DATASET_NAME=da_streaming_pipeline TABLE_NAME_1=cc_account_info STAGING_LOCATION=gs://da_streaming_pipeline/stage TEMP_LOCATION=gs://da_streaming_pipeline/temp export PROJECT_ID PROJECT_NUMBER REGION TOPIC_NAME DATASET_NAME TABLE_NAME_1 STAGING_LOCATION TEMP_LOCATION
Setup Encryption Keys
Here I will create two sets of encryption keys. One regional key for PubSub topic and BigQuery dataset and a Global key for Dataflow. Global key is required because Dataflow does not have an endpoint in europe-west2 (London). In order to process data using worker nodes in europe-west2 (London), the configuration is to use dataflow endpoint in europe-west1 (Belgium) and the zone for worker nodes in europe-west2 (London). The Global key is not required in scenarios where Dataflow worker nodes are in the same region as its regional end point.
Create a CMEK key in London which should be used by PubSub and BigQuery. gcloud kms keyrings create str-pl-london-kr --location ${REGION} gcloud kms keys create str-pl-london-key01 --location ${REGION} --keyring str-pl-london-kr --purpose encryption Create a Global CMEK key which should be used by Dataflow. gcloud kms keyrings create str-pl-global-kr --location global gcloud kms keys create str-pl-global-key01 --location global --keyring str-pl-global-kr --purpose encryption
Grant necessary permissions to service accounts
Grant permission to encrypt and decrypt data using CMEK to service accounts for PubSub, bigquery and Dataflow
Grant permission to PubSub service account gcloud projects add-iam-policy-binding ${PROJECT_ID} --member="serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-pubsub.iam.gserviceaccount.com" --role='roles/cloudkms.cryptoKeyEncrypterDecrypter' Grant permission to BigQuery service accounts gcloud kms keys add-iam-policy-binding --project=data-analytics-bk --member serviceAccount:bq-${PROJECT_NUMBER}@bigquery-encryption.iam.gserviceaccount.com --role roles/cloudkms.cryptoKeyEncrypterDecrypter --location=${REGION} --keyring=str-pl-london-kr str-pl-london-key01 Grant permission to Dataflow and Compute engine service accounts as required by Dataflow. gcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:service-${PROJECT_NUMBER}@dataflow-service-producer-prod.iam.gserviceaccount.com --role roles/cloudkms.cryptoKeyEncrypterDecrypter gcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:service-${PROJECT_NUMBER}@compute-system.iam.gserviceaccount.com --role roles/cloudkms.cryptoKeyEncrypterDecrypter Grant permission to Storage Service account to use Cloud KMS key gsutil kms authorize -p ${PROJECT_ID} -k projects/${PROJECT_ID}/locations/${REGION}/keyRings/str-pl-london kr/cryptoKeys/str-pl-london-key01
Setup PubSub topic for data ingestion
Build KEY Resource ID KEY_ID=projects/${PROJECT_ID}/locations/${REGION}/keyRings/str-pl-london-kr/cryptoKeys/str-pl-london-key01 export KEY_ID Create PubSub Topic gcloud pubsub topics create ${TOPIC_NAME} --message-storage-policy-allowed-regions=${REGION} --topic-encryption-key=$KEY_ID Confirm that the key has been created using gcloud pubsub topics describe ${TOPIC_NAME}
Set up BigQuery Dataset and Table
Create a dataset (da_streaming_pipeline) in London region with default encryption using CMEK key ${KEY_ID}. bq --location=${REGION} mk --dataset --description "dataset for streaming pipeline sink" --default_kms_key ${KEY_ID} ${PROJECT_ID}:${DATASET_NAME}
Create Sink TABLE for Raw data bq query --use_legacy_sql=false " CREATE TABLE ${DATASET_NAME}.${TABLE_NAME_1} ( name STRING, address STRING, country_code STRING, city STRING, bank_iban STRING, company STRING, credit_card INT64, credit_card_provider STRING, credit_card_expires STRING, date_record_created DATETIME, employement STRING, emp_id STRING, name_prefix STRING, phone_number STRING ) OPTIONS( description='Sink table to collect data after processing by dataflow for streaming pipeline' ) "
Setup CMEK protected GCS Bucket
Setup CMEK protected GCS buckets for Staging and Temp locations gsutil mb -c regional -l europe-west2 gs://da_streaming_pipeline gsutil kms encryption -k projects/${PROJECT_ID}/locations/${REGION}/keyRings/str-pl-london-kr/cryptoKeys/str-pl-london-key01 gs://da_streaming_pipeline
Running a Streaming Pipeline
For this pipeline I am using Beam Python SDK version 2.16 with Python 2.7. The pipeline is fairly simple as described above. A data generator script written in Python generates Bank customer details including address and credit card information using Python package faker and pushes to PubSub topic. Code for this data generator is available from my GitHub repository.
Generate Data and ingest to PubSub
./datagen.py -d pubsub -p data-analytics-bk -t str-pl-pubsub-topic -n 100000
Start Data processing pipeline using Dataflow runner
The code for this pipeline is available from my GitHub repository. The core pipeline code is shown below. It is worth noting that the Beam Pipeline code or Pipeline options do not have any specific instructions to deal with customer managed encryption or geo-restriction. Those are source, sink and dataflow configuration parameters.
with beam.Pipeline(options=pipeline_options) as pipeline: ( pipeline | "Read PubSub messages" >> beam.io.ReadFromPubSub(topic=pubsub_topic) | "Decode PubSub messages" >> beam.Map(lambda x: x.decode('utf-8')) | "Convert to Dictionary" >> beam.ParDo(convertToDict()) | "Write to BigQuery" >> beam.io.WriteToBigQuery('{0}:da_streaming_pipeline.cc_account_info'.format(project_name), schema=schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND) )
The following command will start a streaming pipeline using Google Dataflow runner.
python -m streaming_simple_pipeline \ --project_name XXXXXXXXXXXXXX \ --pubsub_topic projects/XXXXXXXXXXXX/topics/str-pl-pubsub-topic \ --streaming \ --runner Dataflow \ --staging_location gs://da_streaming_pipeline/stage \ --temp_location gs://da_streaming_pipeline/temp \ --region europe-west1 --zone=europe-west2-b \ --dataflow_kms_key projects/XXXXXXXXXXXX/locations/global/keyRings/str-pl-global-kr/cryptoKeys/str-pl-global-key01
From the Cloud Dataflow console, you can see that the pipeline is using a Customer Managed Key and it is executing in a particular zone in europe-west2 (London) GCP region.
Revoking Keys
In order to stop access to data, the encryption key can be revoked which will result in failure of Dataflow pipeline. Additionally, any attempt to access the data from sink will return an error.
Revoke keys gcloud kms keys versions disable 1 --location europe-west2 --keyring str-pl-london-kr --key str-pl-london-key01 gcloud kms keys versions disable 1 --location global --keyring str-pl-global-kr --key str-pl-global-key01
Dataflow Pipeline error:
Error returned from BigQuery:
Any other observation
Usually performance is one of the concerns when encrypting and decrypting data. In my small test, I did not observe any performance difference when using CMEK vs Google managed key for pipeline. However performance, to some extent, will depend on the type and volume of data and workload and hence worth spending some testing cycle to performance test a pipeline when using CMEK.
Hope this post was useful for people considering developing and deploying Apache Beam based streaming pipeline on Google Cloud with geo-restriction and have a need to protect using CMEK. Thanks for reading.