Runs

Runs are computational model training jobs executed on the cloud using Spell.

Screenshot of a run

Runs quickstart

Runs are created using the spell run command. The simplest possible run is spell run "echo Hello World", which simply prints "Hello World" in the terminal:

$ spell run "echo Hello World"
Enumerating objects: 45, done.
Counting objects: 100% (45/45), done.
Delta compression using up to 12 threads
Compressing objects: 100% (38/38), done.
Writing objects: 100% (44/44), 2.52 MiB | 5.60 MiB/s, done.
Total 44 (delta 11), reused 0 (delta 0)
To git.spell.ml:spell-org/15af21ee5b9f3e47e817f886fb8d262a0cf6a57b.git
 * [new branch]      HEAD -> br_1b451cbb90e1eb1833a8938f37ec16ac4fef2d60
💫 Casting spell #1577…
✨ Stop viewing logs with ^C
✨ Machine_Requested… done
✨ Building… done
✨ Run is running
Hello World
✨ Saving… done
✨ Pushing… done
🎉 Total run time: 6.870837s
🎉 Run 1577 complete

This run went through several well-defined steps:

  • This command was run from within a git repository. The client uploaded the contents of this repository to Spell.
  • The run orchestrator assigned this run an ID.
  • The run orchestrator provisioned a cloud machine and assigned it to this run.
  • The machine builds the run environment (in a Docker contiainer) and copies over the code repository.
  • The machine executed the run entrypoint, which printed Hello World and then exited out.
  • The machine saved any files written to disk during the run to object storage.
  • The machine pushed the run environment to the Spell Docker cache, then went into an idle state.

Spell handles all of the work of provisioning, orchestrating, and releasing cloud machines so that you don't have to. By caching both code and environment image we minimize build times, and by briefly holding machines in an idle state after the run is complete we minimize machine startup time upon repeated use.

Package and environment dependencies are passed using the --apt, --pip, --env, and --conda-file flags. Datasets are passed using the --mount flag, which takes an object storage path as input. The machine type used for the run is set using the --machine-type flag. Here's a more realistic example showing all of these things in action:

$ spell run \
    --machine-type T4 \
    --mount s3://data-bucket/dataset/:/mnt/train/ \
    --apt ffmpeg \
    --pip moviepy \
    "python style.py --train-path /mnt/train/"

In this example we:

  • Specified that we want a T4 GPU machine with the --machine-type flag.
  • Mounted the contents of the s3://data-bucket/dataset/ to the /mnt/train/ on disk using the --mount flag.
  • Added the ffmpeg apt dependency.
  • Added the moviepy pip dependency.

These are just some of the features that Spell runs supports. The rest of this page will discuss runs in more details.

Selecting a machine type

Runs on Spell are executed on a virtual machines running on the cloud (specifically, an EC2 instance on AWS; a GCE VM on GCP; or an Azure Compute VM on Azure). The --machine-type flag on the spell run command allows you to select the machine type the run will execute on. For example:

$ spell run --machine-type K80 'python example.py'

If no --machine-type is provided, the command will default to the basic cpu machine type.

The list of machine types you have access to depends on your plan. Users on the Spell for Teams or Spell for Enterprise plans can define and use their own machine types. Spell supports machines ranging from simple CPU instances all the way up to latest-gen NVIDIA A100s. See the page Cluster Machine Type Management for more details and a list of supported hardware.

Users of Spell for Developers are limited to the following instance types. CPUs:

Machine Type vCPUs RAM (GB) Price $/hour
cpu (default) 2 4 Free
cpu-big 16 32 $0.68
cpu-huge 72 144 $3.06
ram-big 16 128 $1.01
ram-huge 96 768 $6.05


GPUs:

Machine Type NVIDIA GPU VRAM (GB) TFlops vCPUs RAM (GB) Price $/hour
K80 1 x Tesla K80 12.0 4.3 4 61 $0.90
T4 1 x Tesla T4 16.0 8.1 4 16 $0.526
P100 1 x Tesla P100 16.0 9.3 4 15 $1.66
V100 1 x Tesla V100 16.0 15.7 8 61 $3.06

Mounting code from a local repository

The spell run command is usually executed from within a git repository on your local machine. Spell will automatically git push that code to a GitLab cache. Once a cloud machine is warmed up and ready to execute your code, Spell will git pull the code out of that cache and onto the machine.

You can easily turn any folder on your local machine into a git repository by running the git init command. For example:

$ mkdir project
$ cd project
$ git init
$ echo 'Running on Spell!' > printme.txt
$ git add file && git commit -m "Initial commit"
$ spell run cat printme.txt

Spell will push all committed and staged changes to the run.

Spell's behavior for uncommitted changes is configurable. By default, Spell will push uncommited changes to files that have been checked into the repository. If your repository state includes changes to files that have not yet been checked in, you will be prompted interactively to either check those changes in or proceed without them.

Runs making use of uncommitted changes will include an additional Uncommitted Changes field in the run summary card in the web console which will allow you to download the changes as a diff file (generated using git diff).

To disable pushing uncommitted changes, edit your local machine's Spell config file (located in ~/.spell/ on macOS and Linux, or AppData/Roaming/spell on Windows) and set include_uncommitted to false. Doing so will cause any spell run commands executed in repositories with uncommitted changes (staged or unstaged) to raise an error.

Note that, due to a dependency on rsync, pushing uncommitted changes does not work on Windows or Windows Subsystem for Linux (WSL). As a result, users on Windows machines will always have include_uncommitted set to false (the client ignores the config flag on Windows).

Additionally, note that Spell uses SSH to sync your code. If you require special SSH configuration (e.g. proxy commands for network firewalls) update the ssh_config file in the Spell config directory.

Finally, note that Spell does not support git-lfs or git submodules.

Mounting code from a GitHub repository

Alternatively, you can instruct Spell to pull your code from a GitHub repository instead using the --github-repo flag:

$ spell run \
    --github-repo https://github.com/spellml/cnn-cifar10 \
    "python models/train_basic.py"

This will pull the code from the HEAD of the default branch of the repository on GitHub (typically master or main). To pull a different commit reference, use the --github-ref flag (example: --github-ref ab2718).

--github-repo can be used with any public repository out of the box. It can be used with any private repository you have configured access to using the Spell GitHub integration.

Installing package dependencies

Runs on Spell are executed in a Linux environment with Ubuntu 20.04, Python 3.8, CUDA 10.1, and cuDNN 7.6. The following Python packages are installed by default:

tensorflow==2.3.2
torch==1.8.1
torchvision==0.9.1
pytorch-lightning==1.2.6
requests
numpy
pandas
matplotlib
scikit-learn
xgboost

Spell supports the apt, pip, and conda package managers for further environment customization.

Any package reachable from the default set of package indexes shipped with Ubuntu 20.04 can be installed on Spell using --apt:

$ spell run --apt protobuf-compiler "python train.py"

Additional Python pip dependencies can be installed using the --pip option:

$ spell run --pip dask "python train.py"

Use pip-req to install files from a pip requirements file:

$ spell run --pip-req requirements.txt "python train.py"

Finally, you can install conda packages by supplying a conda requirements file to --conda-file:

$ spell run --conda-file environment.yml "python train.py"

The conda environment name in the run on Spell will be spell.

The conda-env and pip-req flags are mutually exclusive. All other flags can be combined as needed. For example:

$ spell run \
    --apt libprotobuf-dev --apt protobuf-compiler \
    --pip-req requirements.txt --pip "Pillow>=8.3.2" \
    "python train.py"

Packages passed to pip-req flag are installed first, followed by packages installed using pip. As a result, if different versions of the same package are specified in both, the version passed to pip will take precedence.

The pip, pip-req, and conda-env may be used to install private Python packages from GitHub which you have configured access to. To learn more, refer to the page Integrating GitHub.

Requirements files can also be used to specify additional pip installation options such as --find-links or --index-url which Spell's --pip flag does not support. See the requirements file format specification for more information.

Setting environment variables

Environment variables can be passed to a run using the env flag.

$ spell run --env CUDA_VISIBLE_DEVICES=1 "python example.py"

Mounting resources

Resources are groups of files (like datasets) stored in object storage (AWS S3, GCP GCS, or Azure Blob Storage) which may be mounted into a run using the --mount option. To use the mount option, specify the resource path to your dataset and the path to mount the dataset to on Spell. For example, the following code snippet mounts a demo audio dataset to the /mnt/audio-data path and prints out its contents from inside of the run.

$ spell run --mount public/audio/css10:/mnt/audio-data "ls /mnt/audio-data"
💫 Casting spell #9…
✨ Stop viewing logs with ^C
✨ Machine_Requested… done
✨ Building… done
✨ Mounting… done
✨ Run is running
chinese-single-speaker-speech-dataset
dutch-single-speaker-speech-dataset
finnish-single-speaker-speech-dataset
french-single-speaker-speech-dataset
german-single-speaker-speech-dataset
greek-single-speaker-speech-dataset
hungarian-single-speaker-speech-dataset
japanese-single-speaker-speech-dataset
korean-single-speaker-speech-dataset
russian-single-speaker-speech-dataset
spanish-single-speaker-speech-dataset

As a best practice, we recommend using the /mnt/ directory as the root directory for your data.

Alternatively, you can choose omit all or part of the mount path, in which case the default path and/or default name will be used:

$ spell run --mount public/audio/css10:audio-data "python main.py"
$ spell run --mount public/audio/css10 "python main.py"

Resources may be uploaded to Spell, outputted by previous Spell runs, mounted from public buckets, or mounted from private buckets you have configured access to. To learn more about this feature refer to the Resources page.

Saving resources

When you execute a run from within a GitHub repository, Spell will create a /spell/$REPO_NAME/ folder and unpack the contents of your repository into that folder. If you use the --github-url flag instead, those files will land in /spell/. In either case, this folder will be set as your current working directory.

Any files that a Spell run saves within your current working directory will automatically be saved to the runs/$RUNID directory in our virtual filesystem, SpellFS, at the end of the run. These files will then be available to browse, download, or mount into a different run. For example:

$ spell run "echo hello world! > foo.txt"
$ spell run --mount runs/435/foo.txt:foo.txt "cat foo.txt"
hello world!

Additionally, files saved to a mounted directory inside of a Spell run also get backed up.

To learn more about resources see the page Resources.

Interrupting a run

A run can be stopped or killed at any time.

$ spell stop $RUN_ID
$ spell kill $RUN_ID

Stopping a run will send a graceful shutdown signal: the run will save all of its contents to disk before exiting, and all of the files that the run has generated so far (model checkpoints, for example) will be available in the SpellFS.

Killing a run will send a hard shutdown signal. The machine will be interrupted and the run will be killed without being given the opportunity to save to disk.

We recommend only killing runs whose output you are sure you will not need, for example, runs whose code has a bug. Machines whose runs get killed get recycled faster because they do not have to upload any data to object storage, useful when you are debugging. For runs which are long-lived, stopping the run is usually much more appropriate.

Viewing run logs

Spell captures anything your run outputs to the STDOUT (for example, unhandled exceptions and print statements in Python) or STDERR (for example, warnings messages in Python) streams automatically. These logs are persisted to object storage to long-term storage.

To view the logs for a run using the Spell CLI, use the spell logs $RUN_ID command. To view logs using the web console, visit the run details page and scroll to the bottom.

Note

Logs are rate-limited: you may log no more than 1000 loglines within a 10 second rolling interval per run. Additional loglines above this limit will be dropped. If you need to log a lot of lines at once, we recommend outputting that data to a file instead.

(Advanced) Using custom public Docker images

If you have use case that requires environment configuration deeper than what the apt, pip, and conda-file flags can provide, you can chose to provide Spell a custom Docker image instead using the docker-image flag. This parameter takes a container image URL as input, e.g. a URL in the form <domain>/<repository>/<image_name>:<tag>. The domain and repository parts of the path are both optional, if you omit them we default to public DockerHub (https://hub.docker.com/_/) will be used. For example:

$ spell run -f --docker-image 'python:latest' \
    'python -c "print(\"Hello World\")"'

The --docker-image flag can be combined with other environment configuration flags as needed.

(Advanced) Using custom private Docker images

Note

For a demonstration of how you can use this feature to power customize your workspace refer to our blog post: "Using custom Docker images in Spell runs and workspaces".

Users on the community edition must pull from a public registry. Users on Spell For Teams may pull from private Amazon Elastic Container Registry (ECR), Google Container Registry (GCR), or Azure Container Registry (ACR) instead.

The docker-image flag on the spell run command can be used to initialize a run environment using an image pulled from a Docker registry. Teams users that have configured their own cluster can use the private registry of the cluster's cloud provider using the spell cluster add-docker-registry command.

If your cluster is on AWS for example, you can use the command to add one of the ECR repos currently available in that AWS account. This will update the IAM permissions that Spell uses to control your cluster allowing it access to the repo. For Azure clusters, specify the name of the blob container in your storage account using the --repo option.

$ spell cluster add-docker-registry
This command will
    - Allow Spell to get authorization tokens to access your docker registry
    - If no repository is specified, list your repositories in the registry
    - Add read permissions for that repository to the IAM role associated with the cluster
All of this will be done with your AWS profile 'default' which has Access Key ID 'ABCDEFGHIJKL' and region 'us-west-2' - continue? [y/N]: y
Spell does not yet have access to the following repos found in your AWS account:
- image-processing
- text-generation
- image-gan
Please choose a repository: text-generation
...
Successfully added read permission to text-generation

Once set up, you can use the --docker-image argument to spell run to specify a Docker image pushed to the private ECR repository.

The registry permissions can be removed at any time with spell cluster delete-docker-registry.

(Advanced) Installing packages from the local environment

You may have Spell inherit its packages from the local virtual environment you execute the spell run command within. This is done using the --deps-from-env flag:

$ spell run --deps-from-env "python train.py"

If you run this command from within a Virtualenv, Pipenv, or Poetry environment, the pip-chill module will be used to marshall an environment definition. If you run this command from within a Conda environment, the output of the conda env export --from-history command will be used instead.

Packages installed this way will be overwritten by packages installed using the pip, pip-req, and conda-env flags.

(Advanced) Specifying an early stopping condition

Early stopping is the technique of ending a training run early, before the model training process has reached its maximum number of epochs, based on lack of improvement in the model accuracy.

We support a form of early stopping directly in the CLI using the stop-condition flag. For example, --stop-condition "keras/val_acc < 0.5 : 10" will stop model training if the run doesn't reach a validation accuracy of 50 percent by the tenth time this metric is logged (e.g. by the tenth epoch, if you are logging once per epoch), or if the validation accuracy metric dips below 50 percent in subsequent breakpoints.

We support <, >, <=, and >= operators. The second parameter is optional. If left unspecified, the early stopping check will be every single time the training metric is logged.

(Advanced) Run states

Runs transition through the following states:

  • Requested: the run has been created, and is queued for execution. Typically, a run will transition through this state very quickly. However, a run could stay Requested for an extended period of time for two reasons:
  • You already have the maximum number of concurrent runs executing that is allowed per your plan. If this is the case, your run will transition to Building when the number of concurrent runs you have falls below the limit.
  • Spell received a number of runs at the same time and is starting up more machines to execute your run. If this is the case, your run will transition to Building as soon as a machine is ready.
  • Building: the environment for your run is created. This includes installing any dependencies that are specified (e.g. --pip or --apt parameters provided on the command line to create the run) and copying the code from your Git repository into the run. First we check to see if the run uses an environment from a previous run, and if so we used the cached environment to minimize build time in lieu of building the environment again.
  • Mounting: the resources (if any) that were specified are mounted into the run. (See the Resources section for more information).
  • Running: your command is executing!
  • Saving: any new or modified files from your command are saved as a resource into runs/
  • Pushing: your environment is pushed to our cache of environments to expedite a future run with the same environment. Usually, your run will transition through this step very quickly since this push happens asynchronously in the background throughout every other state. The push is usually completed by the time this state is reached unless your run is very quick.

All runs eventually transition to one of the following final states:

  • Complete: this is the normal state for a run to transition to upon completion.
  • Killed: the run was killed using the spell kill command. When a run is killed, it does not transition through any subsequent steps and is immediately terminated.
  • Stopped: the run was stopped using the spell stop command. A stopped run still transitions through the Saving and Pushing states, however its Running state is immediately exited whenever the spell stop command is issued.
  • Failed: the run did not complete due to an error in some other state.
  • Interrupted: the run did not complete because it was executing on a spot instance that got reclaimed (this state is only possible when using spot instances on Spell for Teams).