Accessing AWS services from within Spell using custom instance profiles

Spell's custom instance profile (CIP) is a nifty feature that allows you to configurate your Spell cluster such that the runs you execute and workspaces you create on Spell have access to other resources within your AWS or GCP account. For example, if you use AWS Redshift as the source of truth for analytics at your organization, you configure a CIP on that cluster giving runs access to your Redshift cluster.

This helps make integrating Spell with your existing infrastructure easy. In this blog post, we'll learn about using custom instance profiles to configure access on AWS, using AWS Redshift as our example. In a future blog post we'll cover how this gets done on GCP instead.

A quick primer on AWS's security model

First, let’s take a moment to review the AWS security model and how Spell interacts with it.

The starting point for every team on the Spell for Teams or Spell for Enterprise plan is creating a cluster. A cluster is a group of resources that Spell creates and uses to orchestrate itself within your AWS (or GCP) account.

By default, Spell clusters deploy into a brand new, Spell-only VPC (virtual private cloud) that we create for you as part of the cluster setup flow. All Spell runs and workspaces are orchestrated within this VPC. VPCs are AWS's way of providing network isolation: by default, resources can only see and have access to things located on the same VPC, and cannot touch resources located in other VPCs (unless you use VPC peering — an advanced topic we won't be covering here). Each VPC is allocated a CIDR block of IP addresses; resources which have IP addresses, like EC2 machines, are assigned an IP address from within that range.

You can also choose to deploy Spell into one of your existing VPCs instead. This will make it easier for you to access other resources in your AWS account from within Spell, but it can complicate the setup process.

Once you've deployed a cluster, navigating to the Cluster page in the web console will show you a high-level summary of your cluster configuration.

AWS splits its compute resources into geographical regions (e.g. us-west-2, as here), which are then further subdivided into availability zones (e.g. us-west-2a, us-west-2b, us-west-2c, us-west-2d). As you see here, Spell creates one subnet for each availability zone in your chosen region. At run creation time, Spell creates (or reassigns) an EC2 instance within a randomly chosen subnet to service that run (subject to restrictions on instance type availability and machine availability; instances that fail to create in one availability retried in another AZ; if no AZ can satisfy the request, the request is held in queue until one can).

Network egress from and ingress to the machine executing the Spell run is dictated by a security group attached to the instance. Spell uses a very lenient security group, allowing all inbound and all outbound traffic originating within the group. Thus, to authorize network traffic from Spell runs to non-Spell resources within the same VPC, you need only configure the security group on the non-Spell resource appropriately. We'll see a concrete example of this in a minute.

However, network security rules alone are not sufficient to authorize access to resources in AWS, as the EC2 instance backing your Spell run will also need to have an IAM role set that gives that machine permission to access that resource. This is where instance profiles come in: they enable you to attach IAM roles you've created to Spell runs executed within your cluster, giving those runs access to the AWS APIs and resources this role enables in the process. Again, we'll see a concrete example of this in a minute.

In summary, to access an AWS resource (like a Redshift or EMR cluster) from a Spell run, you need to:

  1. Create a role that grants access to the API you want to use, and attach it to your Spell cluster using a custom instance profile.
  2. Launch that resource in the same VPC that Spell is deployed to.
  3. Configure that resource to allow network ingress from the subnet IP address ranges that Spell uses.

In the next section, we'll cover Step 1. Later on, we'll see how to execute steps 2 and 3, using Amazon Redshift, AWS's column-oriented SQL-database-as-a-service product, as our test case.

Attaching a custom instance profile to your cluster

To attach a custom instance profile to our cluster, we begin by creating the role that we will attach.

Write the following policy file to aws_demo_custom_role_assume_role_policy.json on your local disk:

{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Sid": "",
     "Effect": "Allow",
     "Principal": {
       "Service": "ec2.amazonaws.com"
     },
     "Action": "sts:AssumeRole"
   }
 ]
}

A policy document is a JSON fragment containing some authoritative statements. AWS has a very specific (and often confusing) syntax for these. In this case, we are creating a document with a single statement: one allowing the ec2.amazonaws.com principal to perform the sts.AssumeRole action.

In this specific case, ec2.amazonaws.com is a special "magic" principal granting access to this role to all EC2 instances in this AWS account.

This document is part of the input to the aws iam create-role CLI command:

$ aws iam create-role \
   --role-name demo-custom-creds-role \
   --assume-role-policy-document file://aws_demo_custom_role_assume_role_policy.json

After running this command, the demo-custom-creds-role is available, and any EC2 instance has permission to assume it. However, the role still isn't allowed to actually do anything, because it does not yet have a permissions set. We can do now using aws iam attach-role-policy:

$ aws iam attach-role-policy \
   --policy-arn arn:aws:iam::aws:policy/AmazonRedshiftFullAccess \
   --role-name demo-custom-creds-role

This command gives the role (and, transitively, any EC2 machine that assumes it) administrative access to Amazon Redshift.

However, we're not done yet! Unfortunately it is not possible to assign an IAM role to an EC2 machine directly. AWS has a weird legacy shim, the instance profile, for doing that.

An instance profile is a unique AWS resource which encapsulates an IAM role (it may also be empty). The EC2 API requires using an instance profile, so we'll have to now create one and assign it the IAM role we just created:

$ aws iam create-instance-profile \
   --instance-profile-name demo-custom-creds-instance-profile
$ aws iam add-role-to-instance-profile \
   --instance-profile-name demo-custom-creds-instance-profile \
   --role-name demo-custom-creds-role

That takes us to the last step of the process: assigning this profile to our Spell cluster. We may do so using the spell cluster set-instance-permissions command, using the ARNs (Amazon Resource Identifiers) assigned to our role and instance profile as input:

$ spell cluster set-instance-permissions \
   --iam-role-arn arn:aws:iam::366388869580:role/demo-custom-creds-role \
   --iam-instance-profile-arn arn:aws:iam::366388869580:instance-profile/demo-custom-creds-instance-profile

We can confirm that everything worked as expected by visiting the web console and confirming that the Custom Instance Profile section is now set to point to the profile we provided to Spell:

From this point forward, every cloud machine we launch in this cluster will have access to the Redshift admin API! To test if this is true, try running the following command (replacing region with the region your cluster is deployed in)— you should get an empty list, not a "Not Authorized", in response:

$ spell run \
 --pip awscli \
 -- aws --region us-west-2 redshift describe-clusters

This works because under the hood, Spell is now passing these credentials to the EC2 machine initialization API, which publishes this role to processes running within that machine using the EC2 metadata service. You can learn about this process in all its gory detail by reading the following page in the AWS docs: "Using an IAM role to grant permissions to applications running on Amazon EC2 instances".

Custom instance profile limitations

Setting an appropriately configured custom instance profile on the cluster is sufficient for granting access to API-based AWS services that don't require a persistent connection to another machine.

For example, we may use a CIP to grant access to the Amazon Redshift cluster management API (as here), download and upload to and from Amazon S3, or submit execution steps to Amazon EMR. Here's one example of a command you could spell run to check that your configured correctly:

$ spell run --pip awscli -- aws sts get-caller-identity

Other things — like SSHing into an EMR master node, or connecting a psql client to a Redshift cluster — still won't work. These types of actions are subject to network traffic rules. In the next section, we'll see how these can be configured to grant Spell runs access.

It's important to note that CIP permissions are cluster-wide. Every user in the organization will be granted these permissions. This is a flat, easy-to-manage permissions scheme which works well for small teams, but it may not be a good fit for larger enterprise teams that need fine-grained control over their user permissions.

For that, we recommend using a different design pattern: implementing a custom authentication package. This is discussed in more detail in the following blog post: "Fine-grained access control in Spell using private pip packages".

Configuring network access — a Redshift example

Finally, let's now see how we can go about configuring network access for services that need it. We'll use Amazon Redshift for our example. Redshift is Amazon's columnar-database-as-a-service product, targeted at data warehouse and data analytics needs. It's a hard fork of Postgres, the ubiquitous open source SQL database, and hence uses some of the same tooling — including psql, the Postgres REPL client.

psql is one of the best and easiest ways to interact with a Redshift cluster. However, it doesn't work out of the box, because using psql requires establishing a persistent network connection (SSH over TLS by default) to the cluster master node, which our AWS network configuration does not yet allow.

In this section, we'll see how this works in practice by walking through an example Redshift cluster deploy.

We'll begin by creating a brand new security group, and authorizing all ingress into that security group on the 5339 port (this is Redshift's default port number, and the one we'll use in this demo). We're creating a brand new security group here, instead of messing with an existing one, for cleanliness.

# replace the vpc id here with your own
# to do so, visit your organization's cluster page
# in the web console
$ aws ec2 create-security-group \
   --group-name demo-redshift-security-group \
   --description "VCP security group for the demo DynamoDB cluster." \
   --vpc-id vpc-0ebd894999f0de3f4
-------------------------------------
|        CreateSecurityGroup        |
+----------+------------------------+
|  GroupId |  sg-0aee81cd602b23eb3  |
+----------+------------------------+
# copy the security group id into the next command
$ aws ec2 authorize-security-group-ingress \
   --group-id sg-0aee81cd602b23eb3 \
   --protocol tcp \
   --port 5439 \
   --cidr 0.0.0.0/0

This is the one network ingress rule you will need to give Spell runs access to this cluster!

To ensure that the Redshift cluster is launched in the same VPC as Spell, we need to create a Redshift-specific shim called a "cluster subnet group" and pass it in into the Redshift create-cluster API. Redshift will pick one of these subnets to deploy to. Again, the cluster details page in the Spell web console has the list of subnets which are appropriate to pass into the command (you can also use any other subnet you've created in this VPC —we're using Spell's subnet list just as a matter of convenience):

$ aws redshift create-cluster-subnet-group \
   --cluster-subnet-group-name demo-subnet-group \
   --description "VPC subnet for the demo DynamoDB cluster." \
   --subnet-ids '["subnet-043b6e6e8a5ad339c","subnet-008036d37b1a16e96","subnet-0e0865d7e738289d5","subnet-0d672396f96645054"]'

We're now ready to deploy Redshift, using the create-cluster command to do so:

$ aws redshift create-cluster \
   --db-name demo-db \
   --cluster-identifier demo-cluster \
   --cluster-type single-node \
   --node-type ds2.xlarge \
   --master-username demo-user \
   --master-user-password Agent007 \
   --cluster-subnet-group-name demo-subnet-group \
   --vpc-security-group-ids '["sg-0aee81cd602b23eb3"]' \
   --availability-zone us-west-2c \
   --no-publicly-accessible
-----------------------------------------------------------------
|                         CreateCluster                         |
+---------------------------------------------------------------+
||                           Cluster                           ||
|+-----------------------------------+-------------------------+|
||  AllowVersionUpgrade              |  True                   ||
||  AutomatedSnapshotRetentionPeriod |  1                      ||
||  AvailabilityZone                 |  us-west-2c             ||
||  ClusterAvailabilityStatus        |  Modifying              ||
||  ClusterIdentifier                 |  demo-cluster           ||
||  ClusterStatus                    |  creating               ||
||  ClusterSubnetGroupName           |  demo-subnet-group      ||
||  ClusterVersion                   |  1.0                    ||
||  DBName                           |  demo-db                ||
||  Encrypted                        |  False                  ||
||  EnhancedVpcRouting               |  False                  ||
||  MaintenanceTrackName             |  current                ||
||  ManualSnapshotRetentionPeriod    |  -1                     ||
||  MasterUsername                   |  demo-user              ||
||  NextMaintenanceWindowStartTime   |  2020-11-05T07:00:00Z   ||
||  NodeType                         |  ds2.xlarge             ||
||  NumberOfNodes                    |  1                      ||
||  PreferredMaintenanceWindow       |  thu:07:00-thu:07:30    ||
||  PubliclyAccessible               |  False                  ||
||  VpcId                            |  vpc-0ebd894999f0de3f4  ||
|+-----------------------------------+-------------------------+|
|||                  ClusterParameterGroups                   |||
||+----------------------------+------------------------------+||
|||  ParameterApplyStatus      |  in-sync                     |||
|||  ParameterGroupName        |  default.redshift-1.0        |||
||+----------------------------+------------------------------+||
|||                   PendingModifiedValues                   |||
||+-------------------------------------------+---------------+||
|||  MasterUserPassword                       |  ****         |||
||+-------------------------------------------+---------------+||
|||                     VpcSecurityGroups                     |||
||+---------------------------+-------------------------------+||
|||  Status                   |  active                       |||
|||  VpcSecurityGroupId       |  sg-0aee81cd602b23eb3         |||
||+---------------------------+-------------------------------+||

As you can see, this command has created a demo-cluster containing a demo-db database on a single ds2.xlarge instance. The database has a demo-user master user, with a Agent007 master password. To ensure that the Redshift instance launches in the right VPC, we pass in the demo-subnet-group we created earlier. To ensure it has the right network ingress rules set, we pass in the security group we created earlier.

Nearly there! Once the Redshift cluster has launched, we can use the describe-clusters API to get the network address of the Redshift endpoint:

# you will get a different value, obviously
$ aws redshift describe-clusters \
   --output json \
   --query "Clusters[0].Endpoint.Address"
"demo-cluster.cpe0xazraugv.us-west-2.redshift.amazonaws.com"

This is the address we need to pass to psql (alongside the database, user, port number, and, via interactive prompt, password) to connect to the Redshift database directly. If everything worked as expected, you should be able to launch a Jupyter workspace using the following command:

$ spell jupyter --lab \
   --pip awscli \
   --apt postgresql-client \
   redshift-demo

Then run the following psql command from a console within that workspace to connect to the Redshift cluster (and see a list of databases in the Redshift database, via the \l magic in psql):

$ psql \
 -h demo-cluster.cpe0xazraugv.us-west-2.redshift.amazonaws.com \
 -U demo-user -d demo-db -p 5439 \
 -c "\l"

Why does this work?

psql attempts to connect to this amazonaws.com address, which is associated with a Redshift cluster in the same VPC. To successfully do so, it passes through a couple of checks.

First, the security group rules governing egress from the EC2 instance this request is originating from on is checked. This is a super-permissive security group Spell manages for you that allows all egress, so the network request passes that check. Next, the security group rules governing ingress to the EC2 instance the request is routing to is checked. This is the security group we just created — sg-0aee81cd602b23eb3 in my example — which allows all ingress using TCP on port 5439. Again, the network request checks out, and thus, subject to a password prompt, we are able to connect. ✨

Conclusion

That concludes our demo!

To clean up the custom instance profile resources, use the following commands:

$ aws iam remove-role-from-instance-profile \
   --instance-profile-name demo-custom-creds-instance-profile \
   --role-name demo-custom-creds-role
$ aws iam delete-instance-profile \
   --instance-profile-name demo-custom-creds-instance-profile
$ aws iam detach-role-policy \
   --role-name demo-custom-creds-role \
   --policy-arn arn:aws:iam::aws:policy/AmazonRedshiftFullAccess
$ aws iam delete-role --role-name demo-custom-creds-role

To delete the Redshift cluster and clean up the associated resources (note: the cluster will need to finish deleting before the security group and subnet groups can be deleted as well):

$ aws redshift delete-cluster \
   --cluster-identifier demo-cluster \
   --skip-final-cluster-snapshot
$ aws ec2 delete-security-group \
   --group-id sg-0aee81cd602b23eb3
$ aws redshift delete-cluster-subnet-group \
   --cluster-subnet-group-name demo-subnet-group

In this article, we saw an example of using a custom instance profile on Spell as applied to an Amazon Redshift instance on AWS. Note that though the specific APIs you will need to use for a different AWS service (EMR for example) may be different, the same general ideas that apply to getting Redshift should apply to whatever other AWS service you are interested in accessing as well.

Spell also supports run permissioning on GCP as well: this uses GPC's version of a role-cum-instance-profile, a service account, but is broadly similar. We'll cover using Spell custom instance profiles to connect to the BigQuery service from within a GCP Spell cluster in a future article.

In the meantime, go forth and prosper. 🖖

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.