Runs
Runs are computational model training jobs executed on the cloud using Spell.
Runs quickstart
Runs are created using the spell run
command. The simplest possible run is spell run "echo Hello World"
, which simply prints "Hello World"
in the terminal:
$ spell run "echo Hello World"
Enumerating objects: 45, done.
Counting objects: 100% (45/45), done.
Delta compression using up to 12 threads
Compressing objects: 100% (38/38), done.
Writing objects: 100% (44/44), 2.52 MiB | 5.60 MiB/s, done.
Total 44 (delta 11), reused 0 (delta 0)
To git.spell.ml:spell-org/15af21ee5b9f3e47e817f886fb8d262a0cf6a57b.git
* [new branch] HEAD -> br_1b451cbb90e1eb1833a8938f37ec16ac4fef2d60
💫 Casting spell #1577…
✨ Stop viewing logs with ^C
✨ Machine_Requested… done
✨ Building… done
✨ Run is running
Hello World
✨ Saving… done
✨ Pushing… done
🎉 Total run time: 6.870837s
🎉 Run 1577 complete
This run went through several well-defined steps:
- This command was run from within a
git
repository. The client uploaded the contents of this repository to Spell. - The run orchestrator assigned this run an ID.
- The run orchestrator provisioned a cloud machine and assigned it to this run.
- The machine builds the run environment (in a Docker contiainer) and copies over the code repository.
- The machine executed the run entrypoint, which printed
Hello World
and then exited out. - The machine saved any files written to disk during the run to object storage.
- The machine pushed the run environment to the Spell Docker cache, then went into an idle state.
Spell handles all of the work of provisioning, orchestrating, and releasing cloud machines so that you don't have to. By caching both code and environment image we minimize build times, and by briefly holding machines in an idle state after the run is complete we minimize machine startup time upon repeated use.
Package and environment dependencies are passed using the --apt
, --pip
, --env
, and --conda-file
flags. Datasets are passed using the --mount
flag, which takes an object storage path as input. The machine type used for the run is set using the --machine-type
flag. Here's a more realistic example showing all of these things in action:
$ spell run \
--machine-type T4 \
--mount s3://data-bucket/dataset/:/mnt/train/ \
--apt ffmpeg \
--pip moviepy \
"python style.py --train-path /mnt/train/"
In this example we:
- Specified that we want a T4 GPU machine with the
--machine-type
flag. - Mounted the contents of the
s3://data-bucket/dataset/
to the/mnt/train/
on disk using the--mount
flag. - Added the
ffmpeg
apt
dependency. - Added the
moviepy
pip
dependency.
These are just some of the features that Spell runs supports. The rest of this page will discuss runs in more details.
Selecting a machine type
Runs on Spell are executed on a virtual machines running on the cloud (specifically, an EC2 instance on AWS; a GCE VM on GCP; or an Azure Compute VM on Azure). The --machine-type
flag on the spell run
command allows you to select the machine type the run will execute on. For example:
$ spell run --machine-type T4 'python example.py'
If no --machine-type
is provided, the command will default to the basic cpu
machine type.
The list of machine types you have access to depends on your plan. Users on the Spell for Teams or Spell for Enterprise plans can define and use their own machine types. Spell supports machines ranging from simple CPU instances all the way up to latest-gen NVIDIA A100s. See the page Cluster Machine Type Management for more details and a list of supported hardware.
Users of Spell for Developers are limited to the following instance types. CPUs:
Machine Type | vCPUs | RAM (GB) | Price $/hour |
---|---|---|---|
cpu (default) | 2 | 4 | Free |
cpu-big | 16 | 32 | $0.68 |
cpu-huge | 72 | 144 | $3.06 |
ram-big | 16 | 128 | $1.01 |
ram-huge | 96 | 768 | $6.05 |
GPUs:
Machine Type | NVIDIA GPU | VRAM (GB) | TFlops | vCPUs | RAM (GB) | Price $/hour |
---|---|---|---|---|---|---|
K80 | 1 x Tesla K80 | 12.0 | 4.3 | 4 | 61 | $0.90 |
T4 | 1 x Tesla T4 | 16.0 | 8.1 | 4 | 16 | $0.526 |
P100 | 1 x Tesla P100 | 16.0 | 9.3 | 4 | 15 | $1.66 |
V100 | 1 x Tesla V100 | 16.0 | 15.7 | 8 | 61 | $3.06 |
Mounting code from a local repository
The spell run
command is usually executed from within a git
repository on your local machine. Spell will automatically git push
that code to a GitLab cache. Once a cloud machine is warmed up and ready to execute your code, Spell will git pull
the code out of that cache and onto the machine.
You can easily turn any folder on your local machine into a git
repository by running the git init
command. For example:
$ mkdir project
$ cd project
$ git init
$ echo 'Running on Spell!' > printme.txt
$ git add file && git commit -m "Initial commit"
$ spell run cat printme.txt
Spell will push all committed and staged changes to the run.
Spell's behavior for uncommitted changes is configurable. By default, Spell will push uncommited changes to files that have been checked into the repository. If your repository state includes changes to files that have not yet been checked in, you will be prompted interactively to either check those changes in or proceed without them.
Runs making use of uncommitted changes will include an additional Uncommitted Changes
field in the run summary card in the web console which will allow you to download the changes as a diff file (generated using git diff
).
To disable pushing uncommitted changes, edit your local machine's Spell config
file (located in ~/.spell/
on macOS and Linux, or AppData/Roaming/spell
on Windows) and set include_uncommitted
to false
. Doing so will cause any spell run
commands executed in repositories with uncommitted changes (staged or unstaged) to raise an error.
Note that, due to a dependency on rsync
, pushing uncommitted changes does not work on Windows or Windows Subsystem for Linux (WSL). As a result, users on Windows machines will always have include_uncommitted
set to false
(the client ignores the config
flag on Windows).
Additionally, note that Spell uses SSH to sync your code. If you require special SSH configuration (e.g. proxy commands for network firewalls) update the ssh_config
file in the Spell config directory.
Finally, note that Spell does not support git-lfs or git submodules.
Mounting code from a GitHub repository
Alternatively, you can instruct Spell to pull your code from a GitHub repository instead using the --github-url
flag:
$ spell run \
--github-url https://github.com/spellml/cnn-cifar10 \
"python models/train_basic.py"
This will pull the code from the HEAD
of the default branch of the repository on GitHub (typically master
or main
). To pull a different commit reference, use the --github-ref
flag (example: --github-ref ab2718
).
--github-url
can be used with any public repository out of the box. It can also be used with any private repository you have configured access to using the Spell GitHub integration.
Installing package dependencies
Runs on Spell are executed in a Linux environment with Ubuntu 20.04, Python 3.8, CUDA 11.3, and cuDNN 7.6. The following Python packages are installed by default:
tensorflow==2.6.2
torch==1.10.0+cu113
torchvision==0.11.1+cu113
pytorch-lightning
requests
numpy
pandas
matplotlib
scikit-learn
xgboost
Spell supports the apt
, pip
, and conda
package managers for further environment customization.
Any package reachable from the default set of package indexes shipped with Ubuntu 20.04 can be installed on Spell using --apt
:
$ spell run --apt protobuf-compiler "python train.py"
Additional Python pip
dependencies can be installed using the --pip
option:
$ spell run --pip dask "python train.py"
Use pip-req
to install files from a pip requirements file:
$ spell run --pip-req requirements.txt "python train.py"
Finally, you can install conda
packages by supplying a conda requirements file to --conda-file
:
$ spell run --conda-file environment.yml "python train.py"
The conda
environment name in the run on Spell will be spell
.
The conda-env
and pip-req
flags are mutually exclusive. All other flags can be combined as needed. For example:
$ spell run \
--apt libprotobuf-dev --apt protobuf-compiler \
--pip-req requirements.txt --pip "Pillow>=8.3.2" \
"python train.py"
Packages passed to the --pip-req
flag are installed first, followed by packages installed using --pip
. As a result, if different versions of the same package are specified in both, the version passed to --pip
will take precedence.
The --pip
and --pip-req
flags may be used to install private Python packages from GitHub which you have configured access to. To learn more, refer to the page Integrating GitHub.
The --conda-env
flag does not currently support private Python packages. Pass any private packages you wish to install in a run using a conda
environment to the --pip
flag instead.
The requirements file passed to --pip-req
can be used to specify additional pip
installation options such as --find-links
or --index-url
which Spell's --pip
flag does not support. See the requirements file format specification for more information.
Setting environment variables
Environment variables can be passed to a run using the env
flag.
$ spell run --env CUDA_VISIBLE_DEVICES=1 "python example.py"
Environment variables that begin with the prefix SECRET
(for example, SECRET_KEY
) are considered secrets. Secrets have their contents replaced with <REDACTED>
wherever they would be displayed in the web console and the CLI.
To prevent secrets from leaking, re-runs of runs containing secret environment variables will not re-populate the secret environment variable automatically—it will be set to <REDACTED>
instead. You will need to set that value yourself again.
Mounting resources
Resources are groups of files (like datasets) stored in object storage (AWS S3, GCP GCS, or Azure Blob Storage) which may be mounted into a run using the --mount
option. To use the mount option, specify the resource path to your dataset and the path to mount the dataset to on Spell. For example, the following code snippet mounts a demo audio dataset to the /mnt/audio-data
path and prints out its contents from inside of the run.
$ spell run --mount public/audio/css10:/mnt/audio-data "ls /mnt/audio-data"
💫 Casting spell #9…
✨ Stop viewing logs with ^C
✨ Machine_Requested… done
✨ Building… done
✨ Mounting… done
✨ Run is running
chinese-single-speaker-speech-dataset
dutch-single-speaker-speech-dataset
finnish-single-speaker-speech-dataset
french-single-speaker-speech-dataset
german-single-speaker-speech-dataset
greek-single-speaker-speech-dataset
hungarian-single-speaker-speech-dataset
japanese-single-speaker-speech-dataset
korean-single-speaker-speech-dataset
russian-single-speaker-speech-dataset
spanish-single-speaker-speech-dataset
As a best practice, we recommend using the /mnt/
directory as the root directory for your data.
Alternatively, you can choose omit all or part of the mount path, in which case the default path and/or default name will be used:
$ spell run --mount public/audio/css10:audio-data "python main.py"
$ spell run --mount public/audio/css10 "python main.py"
Resources may be uploaded to Spell, outputted by previous Spell runs, mounted from public buckets, or mounted from private buckets you have configured access to. To learn more about this feature refer to the Resources page.
Saving resources
When you execute a run from within a GitHub repository, Spell will create a /spell/$REPO_NAME/
folder and unpack the contents of your repository into that folder. If you use the --github-url
flag instead, those files will land in /spell/
. In either case, this folder will be set as your current working directory.
Any files that a Spell run saves within your current working directory will automatically be saved to the runs/$RUNID
directory in our virtual filesystem, SpellFS, at the end of the run. These files will then be available to browse, download, or mount into a different run. For example:
$ spell run "echo hello world! > foo.txt"
$ spell run --mount runs/435/foo.txt:foo.txt "cat foo.txt"
hello world!
Additionally, files saved to a mounted directory inside of a Spell run also get backed up.
To learn more about resources see the page Resources.
Interrupting a run
A run can be stopped or killed at any time.
$ spell stop $RUN_ID
$ spell kill $RUN_ID
Stopping a run will send a graceful shutdown signal: the run will save all of its contents to disk before exiting, and all of the files that the run has generated so far (model checkpoints, for example) will be available in the SpellFS.
Killing a run will send a hard shutdown signal. The machine will be interrupted and the run will be killed without being given the opportunity to save to disk.
We recommend only killing runs whose output you are sure you will not need, for example, runs whose code has a bug. Machines whose runs get killed get recycled faster because they do not have to upload any data to object storage, useful when you are debugging. For runs which are long-lived, stopping the run is usually much more appropriate.
Viewing run logs
Spell captures anything your run outputs to the STDOUT
(for example, unhandled exceptions and print
statements in Python) or STDERR
(for example, warnings
messages in Python) streams automatically. These logs are persisted to object storage to long-term storage.
To view the logs for a run using the Spell CLI, use the spell logs $RUN_ID
command. To view logs using the web console, visit the run details page and scroll to the bottom.
Note
Logs are rate-limited: you may log no more than 1000 loglines within a 10 second rolling interval per run. Additional loglines above this limit will be dropped. If you need to log a lot of lines at once, we recommend outputting that data to a file instead.
Diagnosing run failures
When a run fails the reason usually the reason why is obvious—just check the run logs for an error message.
However, runs can sometimes fail in ways that do not print errors to logs.
If a run uses a custom Docker image (via the --docker-image
flag), and the Docker image does not satisfy Spell's requirements for such images, the run will fail during the "build" step, possibly without error logs. For details, refer to the blog post "Using custom Docker images in Spell runs and workspaces". To fix this error, modify your custom Docker image to satisfy these requirements.
If a run runs out of resources (disk, memory, or CPU) during execution, the OS kernel is very likely to kill the run process without logging an explicit error. The Docker container running your code will exit with exit code 137
(if the kernel sends a SIGKILL
) or 143
(if the kernel sends a SIGTERM
). The exit code Spell received from the container is displayed in the "Status" field on the run details card on the web console. If this value is Complete (137)
or Complete (143)
, you know that the run container was killed by the OS kernel, almost assuredly due to resource exhaustion. To fix this error, either reduce the resource utilization of your code or move your run to more powerful hardware.
Very rarely, the underlying cloud machine will experience a hardware failure that puts the instance in a degraded state. For customers running on GCP, GPU instances may additionally be reclaimed by Google during host maintenence events. When this happens, the cloud provider will almost immediately reclaim the machine. Unfortunately there is currently no easy way to tell that this has happened using Spell alone—you will need to visit your cloud provider's console and check up on the instance there.
(Advanced) Using custom public Docker images
If you have use case that requires environment configuration deeper than what the apt
, pip
, and conda-file
flags can provide, you can chose to provide Spell a custom Docker image instead using the docker-image
flag. This parameter takes a container image URL as input, e.g. a URL in the form <domain>/<repository>/<image_name>:<tag>
. The domain and repository parts of the path are both optional, if you omit them we default to public DockerHub (https://hub.docker.com/_/
) will be used. For example:
$ spell run -f --docker-image 'python:latest' \
'python -c "print(\"Hello World\")"'
The --docker-image
flag can be combined with other environment configuration flags as needed.
Custom Docker images must satisfy certain requirements imposed by the Spell image build process. For details, refer to the blog post "Using custom Docker images in Spell runs and workspaces".
(Advanced) Using custom private Docker images
Note
For a demonstration of how you can use this feature to power customize your workspace refer to our blog post: "Using custom Docker images in Spell runs and workspaces".
Users on the community edition must pull from a public registry. Users on Spell For Teams may pull from private Amazon Elastic Container Registry (ECR), Google Container Registry (GCR), or Azure Container Registry (ACR) instead.
The docker-image
flag on the spell run
command can be used to initialize a run environment using an image pulled from a Docker registry. Teams users that have configured their own cluster can use the private registry of the cluster's cloud provider using the spell cluster add-docker-registry
command.
If your cluster is on AWS for example, you can use the command to add one of the ECR repos currently available in that AWS account. This will update the IAM permissions that Spell uses to control your cluster allowing it access to the repo. For Azure clusters, specify the name of the blob container in your storage account using the --repo
option.
$ spell cluster add-docker-registry
This command will
- Allow Spell to get authorization tokens to access your docker registry
- If no repository is specified, list your repositories in the registry
- Add read permissions for that repository to the IAM role associated with the cluster
All of this will be done with your AWS profile 'default' which has Access Key ID 'ABCDEFGHIJKL' and region 'us-west-2' - continue? [y/N]: y
Spell does not yet have access to the following repos found in your AWS account:
- image-processing
- text-generation
- image-gan
Please choose a repository: text-generation
...
Successfully added read permission to text-generation
Once set up, you can use the --docker-image
argument to spell run
to specify a Docker image pushed to the private ECR repository.
The registry permissions can be removed at any time with spell cluster delete-docker-registry
.
Private custom Docker images, like public custom Docker images, are subject to certain requirements imposed by the Spell image build process. For details, refer to the blog post "Using custom Docker images in Spell runs and workspaces".
(Advanced) Installing packages from the local environment
You may have Spell inherit its packages from the local virtual environment you execute the spell run
command within. This is done using the --deps-from-env
flag:
$ spell run --deps-from-env "python train.py"
If you run this command from within a Virtualenv, Pipenv, or Poetry environment, the pip-chill module will be used to marshall an environment definition. If you run this command from within a Conda environment, the output of the conda env export --from-history
command will be used instead.
Packages installed this way will be overwritten by packages installed using the pip
, pip-req
, and conda-env
flags.
(Advanced) Specifying an early stopping condition
Early stopping is the technique of ending a training run early, before the model training process has reached its maximum number of epochs, based on lack of improvement in the model accuracy.
We support a form of early stopping directly in the CLI using the stop-condition
flag. For example, --stop-condition "keras/val_acc < 0.5 : 10"
will stop model training if the run doesn't reach a validation accuracy of 50 percent by the tenth time this metric is logged (e.g. by the tenth epoch, if you are logging once per epoch), or if the validation accuracy metric dips below 50 percent in subsequent breakpoints.
We support <
, >
, <=
, and >=
operators. The second parameter is optional. If left unspecified, the early stopping check will be every single time the training metric is logged.
(Advanced) Run states
Runs transition through the following states:
- Requested: the run has been created, and is queued for execution. Typically, a run will transition through this state very quickly. However, a run could stay
Requested
for an extended period of time for two reasons: - You already have the maximum number of concurrent runs executing that is allowed per your plan. If this is the case, your run will transition to
Building
when the number of concurrent runs you have falls below the limit. - Spell received a number of runs at the same time and is starting up more machines to execute your run. If this is the case, your run will transition to
Building
as soon as a machine is ready. - Building: the environment for your run is created. This includes installing any dependencies that are specified (e.g.
--pip
or--apt
parameters provided on the command line to create the run) and copying the code from your Git repository into the run. First we check to see if the run uses an environment from a previous run, and if so we used the cached environment to minimize build time in lieu of building the environment again. - Mounting: the resources (if any) that were specified are mounted into the run. (See the Resources section for more information).
- Running: your command is executing!
- Saving: any new or modified files from your command are saved as a resource into
runs/
- Pushing: your environment is pushed to our cache of environments to expedite a future run with the same environment. Usually, your run will transition through this step very quickly since this push happens asynchronously in the background throughout every other state. The push is usually completed by the time this state is reached unless your run is very quick.
All runs eventually transition to one of the following final states:
- Complete: this is the normal state for a run to transition to upon completion.
- Killed: the run was killed using the
spell kill
command. When a run is killed, it does not transition through any subsequent steps and is immediately terminated. - Stopped: the run was stopped using the
spell stop
command. A stopped run still transitions through theSaving
andPushing
states, however itsRunning
state is immediately exited whenever thespell stop
command is issued. - Failed: the run did not complete due to an error in some other state.
- Interrupted: the run did not complete because it was executing on a spot instance that got reclaimed (this state is only possible when using spot instances on Spell for Teams).