Spell's custom instance profile (CIP) is a nifty feature that allows you to configurate your Spell cluster such that the runs you execute and workspaces you create on Spell have access to other resources within your AWS or GCP account. For example, if you use AWS Redshift as the source of truth for analytics at your organization, you configure a CIP on that cluster giving runs access to your Redshift cluster.
This helps make integrating Spell with your existing infrastructure easy. In this blog post, we'll learn about using custom instance profiles to configure access on GCP, using GCP BigQuery as our example. For Spell users on AWS, refer to our previous blog post on accessing AWS.
A quick primer on GCP's security model
First, let’s take a moment to review the GCP security model, and how Spell interacts with it.
The starting point for every team on the Spell for Teams or Spell for Enterprise plan is creating a cluster. A cluster is a group of resources that Spell creates and uses to orchestrate itself within your GCP account.
By default, Spell clusters deploy into a brand new Spell-only VPC (virtual private cloud) we create for you as part of the cluster setup flow. All Spell runs and workspaces are orchestrated within this VPC. VPCs are how GCP provides network isolation: by default, resources can only see and have access to things located on the same VPC, and cannot touch resources located in other VPCs (unless you use VPC peering — an advanced topic we won’t be covering here).
Each VPC contains one or more subnets, which gets allocated a CIDR block of internal IP addresses that services running within that subnet may use. Spell will create a single subnet for your chosen cluster region (e.g. us-central1), and every GCE instance (virtual machine) Spell creates (or reassigns) will be located in a random availability zone (e.g. us-central1-a, us-central1-b, us-central1-c, us-central1-f) within that region (subject to restrictions on instance type availability and machine availability; instances that fail to create in one availability retried in another AZ; if no AZ can satisfy the request, the request is held in queue until one can). The machines Spell creates are given an ephemeral private IP address, so they may take on any IP address from within the subnet address range.
Network egress from and ingress to the machine executing the Spell run is dictated by firewall rules attached to the VPC. The default firewall rules for an "automatic mode" VPC allow most forms of ingress and egress within the VPC, as well as SSH ingress traffic from outside of it. Thus, to authorize network traffic from Spell runs to non-Spell resources within the same VPC, you shouldn’t need to do anything special. Note however that if you have modified the firewall rules in the Spell VPC to restrict within-cluster traffic, this may no longer be true.
Network security rules alone are not sufficient to authorize access to resources in GCP, as the GCE instance backing your Spell run will also need to have a service account set that gives that machine permission to access that resource. Spell calls this a custom instance profile (named after the equivalent AWS feature, the instance profile), and it allows you to modify and manage this linked profile using the Spell CLI.
In summary, to, to access a GCP resource (like a BigQuery table) from a Spell run, you need to:
- Create a service account that grants access to the API you want to use, and attach it to your Spell cluster using a custom instance profile.
- Launch that resource in the same VPC that Spell is deployed to.
- Configure that resource to allow network ingress from the subnet IP address ranges that Spell uses.
Luckily requirement 3 is almost certainly automatically satisfied for you, assuming you launch Spell into its own VPC (or into an existing VPC with the default firewall rules set), so in practice, to give Spell runs access to other services within your GCP account, you’ll usually only need to worry about bullet points 1 and 2.
Attaching a custom instance profile to your cluster
To attach a custom instance profile to our cluster, we begin by creating the service account that we will attach. This can be done using the gcloud CLI like so:
$ PROJECT_ID=$(gcloud config get-value project) $ gcloud iam service-accounts create \ "aleksey-demo-service-account" \ --description="Service account for the GCP CIP demo"
This creates a new service account in your currently authenticated GCP project. However, this service account is currently empty — it does not currently have any permissions assigned to it. On GCP, permissions are usually added using IAM roles, which are collections of related permissions which may be freely attached and detached from a service account as needed. Since we will be using BigQuery as our test case, let’s now make this service account a BigQuery admin:
$ gcloud projects add-iam-policy-binding $PROJECT_ID \ --member="serviceAccount:aleksey-demo-service-account@$PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/bigquery.admin"
Finally, we attach this service account to our Spell cluster using the spell cluster set-instance-permissions command:
$ spell cluster set-instance-permissions \ --iam-service-account "aleksey-demo-service-account@$PROJECT_ID.iam.gserviceaccount.com"
If you navigate to the web console and check the Custom Instance Profile line in the Cluster Details card, you should see that this custom instance profile is now set!
Custom instance profile limitations
It’s important to note that CIP permissions are cluster-wide. Every user in the organization will be granted these permissions. This is a flat, easy-to-manage permissions scheme which works well for small teams, but it may not be a good for fit for larger enterprise teams that need fine-grained control over their user permissions.
For that, we recommend using a different design pattern: implementing a custom authentication package. This is discussed in more detail in a previous blog post: "Fine-grained access control in Spell using private pip packages".
Custom instance profiles in action — a BigQuery example
In the previous section we created a custom instance profile with BigQuery permissions and attached it to our cluster. In this step, let’s take the follow-up step of actually using this access to run a demo BigQuery job.
BigQuery is Google’s massively parallel columnar database-as-a-service offering. It competes with services like Apache Spark and Amazon Redshift to offer high-volume analytics queries on large volumes of data with low query times. BigQuery is based on Dremel, an internal Google technology and the subject of one of the most well-read papers in data engineering. Apache Drill is an open-source implementation of the same technology.
BigQuery is an API-driver service, so there is no need to worry about cluster setup or cluster management when using it. Instead, you use the bq command-line client to talk to BigQuery directly. Queries are submitted directly from bq. Data can be loaded into BigQuery from files located on local disk or from files in Google Cloud Storage the same way.
Try it yourself now. Use spell jupyter to spin up a new Spell workspace in your cluster:
$ spell jupyter --lab --pip google-cloud-bigquery bigquery-demo
Then run the following command in a code cell in the notebook:
from google.cloud import bigquery client = bigquery.Client() QUERY = ( 'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` ' 'WHERE state = "TX" ' 'LIMIT 10') query_job = client.query(QUERY) # API request rows = query_job.result() # Waits for query to finish for row in rows: print(row.name)
This should print out a list of names from one of GCP's test datasets:
Frances Alice Beatrice Ella Gertrude Josephine Lula Blanche Marjorie Christine
That concludes our demo!
To clean up the custom instance profile resources, use the following commands:
$ spell cluster unset-instance-permissions # ORGNAME is your organization's name, you can look this value up using: # $ gcloud iam service-accounts list $ gcloud iam service-accounts delete \ aleksey-demo-service-account@$ORGNAME.iam.gserviceaccount.com
In this article, we saw an example of using a custom instance profile on Spell as applied to a GCP BigQuery instance. Note that though the specific APIs you will need to use for a different GCP service may be different, the same general ideas that apply to BigQuery should apply to whatever other AWS service you are interested in accessing as well.
In the meantime, go forth and prosper. 🖖