Attaching private cross-account S3 buckets to Spell

A core component of Spell is our virtualized filesystem, SpellFS.

SpellFS is essentially an abstraction over blob storage (AWS S3, GCP GCS, or Azure Blob Storage, depending on the cluster). Enabling access to your data within Spell runs, workspaces, or model servers generally requires first granting Spell read access to that bucket (attaching it to your cluster) using the spell cluster add-bucket API.

Attaching buckets in the same cloud account as your cluster is generally a straightforward process. However, the process is a bit more involved if your data and your Spell cluster are located in different cloud accounts.

This article is a walkthrough on attaching so-called cross-account buckets with a Spell cluster on AWS. Note that at the time of writing this feature is only supported for Spell clusters on AWS.

Understanding AWS policies

In AWS lingo, permissions are granted using policy documents, which act as wrappers are around individual policies. Security principals on AWS accept zero or more policy documents, each one containing zero or more policies. Policies themselves grant zero or more actions to zero or more resources. As an example, consider the following example policy document from the AWS docs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:ListBucket*",
        "s3:PutBucket*",
        "s3:GetBucket*"
      ],
      "Resource": [
        "arn:aws:s3:::examplebucket"
      ]
    }
  ]
}

This policy document contains a list of policies with only one entry. This lone policy grants permission to perform a set of four S3-related actions to a single resource, the S3 bucket examplebucket (not a real bucket).

Policy document don't do anything until they're attached to a security principal. For the purposes of this walkthrough, the two types of security principals that you need to understand are roles and buckets.

A role is an assumable entity that can perform actions on or within your account. Attaching a security principal to a role changes grows the list of actions it can perform. A role may have zero or more policy documents attached (and a role with zero attachments has no permissions whatsoever). All of the actions Spell executes within your account use a role we create for you at cluster create time.

A bucket is, well, an S3 bucket. Each S3 bucket may have exactly zero or one policy documents attached. The document attached becomes what is known as the bucket policy for the bucket. Bucket policies are an optional feature (by default, buckets have an empty policy, which does nothing) that allows the bucket owner to give access to the bucket to additional entities. This includes, critically, entities outside of the cloud account the bucket is located in. In fact, this use case—granting access to entities outside the owner's cloud account—is the primary purpose of the bucket policy feature.

So in summary, attaching a cross-account bucket to Spell requires two things:

  1. The role that your Spell cluster uses must include the cross-account bucket in its Resource list.
  2. The bucket must include that role as a Principal in its bucket policy.

Adding cross-account buckets to Spell: an example

Now that we understand the concepts involved, let's actually walk through adding a cross-account bucket to a Spell cluster.

For the purposes of this demonstration I will use the S3 bucket mini-cluster-tests.

This is a tiny S3 bucket containing just a single file—simple-csv.csv—that I created in my personal AWS account. We will be attaching this bucket to a cluster deployed in spell2, the Spell team's test AWS account.

We will be using the AWS web console, but note that everything we will do here can be done from the command line using the aws CLI tool (or in Python using boto3) as well. To begin, navigate to your bucket's homepage in the AWS console (above), then click on "Permissions" in the navbar. This should take you to a permissions setting page with the two sections we need near the top—"Block public access" and "Bucket policy":

In the "Block public access" section, click on Edit, then uncheck the check mark to disable this block. Note that disabling this is not alone enough to enable access to your bucket's objects, it's just an additional safety check that's mostly there to keep things from going wrong (for readers familiar with AWS EC2, this is akin to "Termination Protection").

Next, we need to fill out the bucket policy. This is a (JSON) policy document, editable in your web browser, with the following general shape:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "SpellReadS3CrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "$PRINCIPAL"
      },
      "Action": [
        ..."$ACTIONS"
      ],
      "Resource": [
        "arn:aws:s3:::$BUCKET_NAME",
        "arn:aws:s3:::$BUCKET_NAME/*"
      ]
    }
  ]
}

Replace $PRINCIPAL with the ARN (unique ID) your Spell cluster's role. This is the top item on the cluster details page:

So in our example case it is arn:aws:iam::366388869580:role/SpellAccess-8490497.

Replace $PERMISSIONS with the list of actions Spell will need to be able to perform on this bucket. A copy of this list is available in our AWS cluster configuration documentation. As of the time of writing it is:

[
  "s3:GetLifecycleConfiguration",
  "s3:GetBucketTagging",
  "s3:GetInventoryConfiguration",
  "s3:GetObjectVersionTagging",
  "s3:ListBucketVersions",
  "s3:GetBucketLogging",
  "s3:ListBucket",
  "s3:GetAccelerateConfiguration",
  "s3:GetBucketPolicy",
  "s3:GetObjectVersionTorrent",
  "s3:GetObjectAcl",
  "s3:GetEncryptionConfiguration",
  "s3:GetBucketRequestPayment",
  "s3:GetObjectVersionAcl",
  "s3:GetObjectTagging",
  "s3:GetMetricsConfiguration",
  "s3:GetBucketPublicAccessBlock",
  "s3:GetBucketPolicyStatus",
  "s3:ListBucketMultipartUploads",
  "s3:GetBucketWebsite",
  "s3:GetBucketVersioning",
  "s3:GetBucketAcl",
  "s3:GetBucketNotification",
  "s3:GetReplicationConfiguration",
  "s3:ListMultipartUploadParts",
  "s3:GetObject",
  "s3:GetObjectTorrent",
  "s3:GetBucketCORS",
  "s3:GetAnalyticsConfiguration",
  "s3:GetObjectVersionForReplication",
  "s3:GetBucketLocation",
  "s3:GetObjectVersion"
]

You'll notice that this list consists solely of list and read type actions—this is because Spell mounts are read-only. Spell will never write back to a bucket it doesn't own. Furthermore, you'll notice that this list contains many actions (like s3:GetBucketWebsite) that Spell very obviously doesn't actually currently make use of. This is just a bit of future-proofing, it allows us to avoid any cluster migrations should we decide to start using any of these features in the future.

Finally, replace $BUCKET_NAME in the Resource section with the name of your bucket (mini-cluster-tests, in this example).

Note that the Resource section has two entries. One is the ARN of the bucket (arn:aws:s3:::mini-cluster-tests), the other is a wildcard entry for any object within the bucket (arn:aws:s3:::mini-cluster-tests/*). Some AWS S3 actions expect a bucket, and others exact an object within a bucket. We could have two policies in the document instead, one containing the list of actions that address a bucket, the other ones that address an object within the bucket . However, the AWS policy document specification has the neat trick that if you combine the two in one Resource statement, as long as it's unambiguous to AWS which action(s) applies to what resource(s), it "just works".

After making these changes, here is the contents of our bucket policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "SpellReadS3CrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::366388869580:role/SpellAccess-8490497"
      },
      "Action": [
        "s3:GetLifecycleConfiguration",
        "s3:GetBucketTagging",
        "s3:GetInventoryConfiguration",
        "s3:GetObjectVersionTagging",
        "s3:ListBucketVersions",
        "s3:GetBucketLogging",
        "s3:ListBucket",
        "s3:GetAccelerateConfiguration",
        "s3:GetBucketPolicy",
        "s3:GetObjectVersionTorrent",
        "s3:GetObjectAcl",
        "s3:GetEncryptionConfiguration",
        "s3:GetBucketRequestPayment",
        "s3:GetObjectVersionAcl",
        "s3:GetObjectTagging",
        "s3:GetMetricsConfiguration",
        "s3:GetBucketPublicAccessBlock",
        "s3:GetBucketPolicyStatus",
        "s3:ListBucketMultipartUploads",
        "s3:GetBucketWebsite",
        "s3:GetBucketVersioning",
        "s3:GetBucketAcl",
        "s3:GetBucketNotification",
        "s3:GetReplicationConfiguration",
        "s3:ListMultipartUploadParts",
        "s3:GetObject",
        "s3:GetObjectTorrent",
        "s3:GetBucketCORS",
        "s3:GetAnalyticsConfiguration",
        "s3:GetObjectVersionForReplication",
        "s3:GetBucketLocation",
        "s3:GetObjectVersion"
      ]
      "Resource": [
        "arn:aws:s3:::mini-cluster-tests",
        "arn:aws:s3:::mini-cluster-tests/*"
      ]
    }
  ]
}

After clicking Save, this policy is now attached to our bucket.

The last thing we need to do is add this bucket to the Spell role (arn:aws:iam::366388869580:role/SpellAccess-8490497). Luckily, this step is now automated! We can proceed directly to attaching the bucket to Spell by running the following command:

// mini-cluster-tests is the name of the example test bucket
$ spell cluster add-bucket --cross-account --bucket mini-cluster-tests

The --cross-account flag tells Spell to treat this attachment as a cross-account bucket. When you execute this command, the Spell API server adds this bucket to the list of buckets in the Resource section of the S3Read policy document attached to the Spell IAM role. It then smoke tests that you actually do have access to this bucket (via the bucket policy you just created) by assuming the Spell IAM role and running a s3:ListBucket action on the bucket. If this API request succeeds, we go ahead and add it to the list of buckets you have attached to the cluster. If it fails, we return an error.

Once the bucket is attached onto the cluster it will be available to mount to any and all of your runs, workspaces, and model servers!

For example, after running this command I could verify that it succeeded and that the objects were available by creating a new test workspace mounting this bucket:

And verifying that we can access the data from within the resulting JupyterLab instance:

That concludes this walkthrough!

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.