Runs are computational model training jobs executed on the cloud using Spell.
Runs are created using the
spell run command. The simplest possible run is
spell run "echo Hello World", which simply prints
"Hello World" in the terminal:
$ spell run "echo Hello World" Enumerating objects: 45, done. Counting objects: 100% (45/45), done. Delta compression using up to 12 threads Compressing objects: 100% (38/38), done. Writing objects: 100% (44/44), 2.52 MiB | 5.60 MiB/s, done. Total 44 (delta 11), reused 0 (delta 0) To git.spell.ml:spell-org/15af21ee5b9f3e47e817f886fb8d262a0cf6a57b.git * [new branch] HEAD -> br_1b451cbb90e1eb1833a8938f37ec16ac4fef2d60 💫 Casting spell #1577… ✨ Stop viewing logs with ^C ✨ Machine_Requested… done ✨ Building… done ✨ Run is running Hello World ✨ Saving… done ✨ Pushing… done 🎉 Total run time: 6.870837s 🎉 Run 1577 complete
This run went through several well-defined steps:
- This command was run from within a
gitrepository. The client uploaded the contents of this repository to Spell.
- The run orchestrator assigned this run an ID.
- The run orchestrator provisioned a cloud machine and assigned it to this run.
- The machine builds the run environment (in a Docker contiainer) and copies over the code repository.
- The machine executed the run entrypoint, which printed
Hello Worldand then exited out.
- The machine saved any files written to disk during the run to object storage.
- The machine pushed the run environment to the Spell Docker cache, then went into an idle state.
Spell handles all of the work of provisioning, orchestrating, and releasing cloud machines so that you don't have to. By caching both code and environment image we minimize build times, and by briefly holding machines in an idle state after the run is complete we minimize machine startup time upon repeated use.
Package and environment dependencies are passed using the
--conda-file flags. Datasets are passed using the
--mount flag, which takes an object storage path as input. The machine type used for the run is set using the
--machine-type flag. Here's a more realistic example showing all of these things in action:
$ spell run \ --machine-type T4 \ --mount s3://data-bucket/dataset/:/mnt/train/ \ --apt ffmpeg \ --pip moviepy \ "python style.py --train-path /mnt/train/"
In this example we:
- Specified that we want a T4 GPU machine with the
- Mounted the contents of the
/mnt/train/on disk using the
- Added the
- Added the
These are just some of the features that Spell runs supports. The rest of this page will discuss runs in more details.
Selecting a machine type
Runs on Spell are executed on a virtual machines running on the cloud (specifically, an EC2 instance on AWS; a GCE VM on GCP; or an Azure Compute VM on Azure). The
--machine-type flag on the
spell run command allows you to select the machine type the run will execute on. For example:
$ spell run --machine-type T4 'python example.py'
--machine-type is provided, the command will default to the basic
cpu machine type.
The list of machine types you have access to depends on your plan. Users on the Spell for Teams or Spell for Enterprise plans can define and use their own machine types. Spell supports machines ranging from simple CPU instances all the way up to latest-gen NVIDIA A100s. See the page Cluster Machine Type Management for more details and a list of supported hardware.
Users of Spell for Developers are limited to the following instance types. CPUs:
|Machine Type||vCPUs||RAM (GB)||Price $/hour|
|Machine Type||NVIDIA GPU||VRAM (GB)||TFlops||vCPUs||RAM (GB)||Price $/hour|
| ||1 x Tesla K80||12.0||4.3||4||61||$0.90|
| ||1 x Tesla T4||16.0||8.1||4||16||$0.526|
| ||1 x Tesla P100||16.0||9.3||4||15||$1.66|
| ||1 x Tesla V100||16.0||15.7||8||61||$3.06|
Mounting code from a local repository
spell run command is usually executed from within a
git repository on your local machine. Spell will automatically
git push that code to a GitLab cache. Once a cloud machine is warmed up and ready to execute your code, Spell will
git pull the code out of that cache and onto the machine.
You can easily turn any folder on your local machine into a
git repository by running the
git init command. For example:
$ mkdir project $ cd project $ git init $ echo 'Running on Spell!' > printme.txt $ git add file && git commit -m "Initial commit" $ spell run cat printme.txt
Spell will push all committed and staged changes to the run.
Spell's behavior for uncommitted changes is configurable. By default, Spell will push uncommited changes to files that have been checked into the repository. If your repository state includes changes to files that have not yet been checked in, you will be prompted interactively to either check those changes in or proceed without them.
Runs making use of uncommitted changes will include an additional
Uncommitted Changes field in the run summary card in the web console which will allow you to download the changes as a diff file (generated using
To disable pushing uncommitted changes, edit your local machine's Spell
config file (located in
~/.spell/ on macOS and Linux, or
AppData/Roaming/spell on Windows) and set
false. Doing so will cause any
spell run commands executed in repositories with uncommitted changes (staged or unstaged) to raise an error.
Note that, due to a dependency on
rsync, pushing uncommitted changes does not work on Windows or Windows Subsystem for Linux (WSL). As a result, users on Windows machines will always have
include_uncommitted set to
false (the client ignores the
config flag on Windows).
Additionally, note that Spell uses SSH to sync your code. If you require special SSH configuration (e.g. proxy commands for network firewalls) update the
ssh_config file in the Spell config directory.
Mounting code from a GitHub repository
Alternatively, you can instruct Spell to pull your code from a GitHub repository instead using the
$ spell run \ --github-url https://github.com/spellml/cnn-cifar10 \ "python models/train_basic.py"
This will pull the code from the
HEAD of the default branch of the repository on GitHub (typically
main). To pull a different commit reference, use the
--github-ref flag (example:
--github-url can be used with any public repository out of the box. It can also be used with any private repository you have configured access to using the Spell GitHub integration.
Installing package dependencies
Runs on Spell are executed in a Linux environment with Ubuntu 20.04, Python 3.8, CUDA 11.3, and cuDNN 7.6. The following Python packages are installed by default:
tensorflow==2.6.2 torch==1.10.0+cu113 torchvision==0.11.1+cu113 pytorch-lightning requests numpy pandas matplotlib scikit-learn xgboost
Spell supports the
conda package managers for further environment customization.
Any package reachable from the default set of package indexes shipped with Ubuntu 20.04 can be installed on Spell using
$ spell run --apt protobuf-compiler "python train.py"
pip dependencies can be installed using the
$ spell run --pip dask "python train.py"
pip-req to install files from a pip requirements file:
$ spell run --pip-req requirements.txt "python train.py"
Finally, you can install
conda packages by supplying a conda requirements file to
$ spell run --conda-file environment.yml "python train.py"
conda environment name in the run on Spell will be
pip-req flags are mutually exclusive. All other flags can be combined as needed. For example:
$ spell run \ --apt libprotobuf-dev --apt protobuf-compiler \ --pip-req requirements.txt --pip "Pillow>=8.3.2" \ "python train.py"
Packages passed to the
--pip-req flag are installed first, followed by packages installed using
--pip. As a result, if different versions of the same package are specified in both, the version passed to
--pip will take precedence.
--pip-req flags may be used to install private Python packages from GitHub which you have configured access to. To learn more, refer to the page Integrating GitHub.
--conda-env flag does not currently support private Python packages. Pass any private packages you wish to install in a run using a
conda environment to the
--pip flag instead.
The requirements file passed to
--pip-req can be used to specify additional
pip installation options such as
--index-url which Spell's
--pip flag does not support. See the requirements file format specification for more information.
Setting environment variables
Environment variables can be passed to a run using the
$ spell run --env CUDA_VISIBLE_DEVICES=1 "python example.py"
Environment variables that begin with the prefix
SECRET (for example,
SECRET_KEY) are considered secrets. Secrets have their contents replaced with
<REDACTED> wherever they would be displayed in the web console and the CLI.
To prevent secrets from leaking, re-runs of runs containing secret environment variables will not re-populate the secret environment variable automatically—it will be set to
<REDACTED> instead. You will need to set that value yourself again.
Resources are groups of files (like datasets) stored in object storage (AWS S3, GCP GCS, or Azure Blob Storage) which may be mounted into a run using the
--mount option. To use the mount option, specify the resource path to your dataset and the path to mount the dataset to on Spell. For example, the following code snippet mounts a demo audio dataset to the
/mnt/audio-data path and prints out its contents from inside of the run.
$ spell run --mount public/audio/css10:/mnt/audio-data "ls /mnt/audio-data"
💫 Casting spell #9… ✨ Stop viewing logs with ^C ✨ Machine_Requested… done ✨ Building… done ✨ Mounting… done ✨ Run is running chinese-single-speaker-speech-dataset dutch-single-speaker-speech-dataset finnish-single-speaker-speech-dataset french-single-speaker-speech-dataset german-single-speaker-speech-dataset greek-single-speaker-speech-dataset hungarian-single-speaker-speech-dataset japanese-single-speaker-speech-dataset korean-single-speaker-speech-dataset russian-single-speaker-speech-dataset spanish-single-speaker-speech-dataset
As a best practice, we recommend using the
/mnt/ directory as the root directory for your data.
Alternatively, you can choose omit all or part of the mount path, in which case the default path and/or default name will be used:
$ spell run --mount public/audio/css10:audio-data "python main.py" $ spell run --mount public/audio/css10 "python main.py"
Resources may be uploaded to Spell, outputted by previous Spell runs, mounted from public buckets, or mounted from private buckets you have configured access to. To learn more about this feature refer to the Resources page.
When you execute a run from within a GitHub repository, Spell will create a
/spell/$REPO_NAME/ folder and unpack the contents of your repository into that folder. If you use the
--github-url flag instead, those files will land in
/spell/. In either case, this folder will be set as your current working directory.
Any files that a Spell run saves within your current working directory will automatically be saved to the
runs/$RUNID directory in our virtual filesystem, SpellFS, at the end of the run. These files will then be available to browse, download, or mount into a different run. For example:
$ spell run "echo hello world! > foo.txt" $ spell run --mount runs/435/foo.txt:foo.txt "cat foo.txt" hello world!
Additionally, files saved to a mounted directory inside of a Spell run also get backed up.
To learn more about resources see the page Resources.
Interrupting a run
A run can be stopped or killed at any time.
$ spell stop $RUN_ID $ spell kill $RUN_ID
Stopping a run will send a graceful shutdown signal: the run will save all of its contents to disk before exiting, and all of the files that the run has generated so far (model checkpoints, for example) will be available in the SpellFS.
Killing a run will send a hard shutdown signal. The machine will be interrupted and the run will be killed without being given the opportunity to save to disk.
We recommend only killing runs whose output you are sure you will not need, for example, runs whose code has a bug. Machines whose runs get killed get recycled faster because they do not have to upload any data to object storage, useful when you are debugging. For runs which are long-lived, stopping the run is usually much more appropriate.
Viewing run logs
Spell captures anything your run outputs to the
STDOUT (for example, unhandled exceptions and
STDERR (for example,
warnings messages in Python) streams automatically. These logs are persisted to object storage to long-term storage.
To view the logs for a run using the Spell CLI, use the
spell logs $RUN_ID command. To view logs using the web console, visit the run details page and scroll to the bottom.
Logs are rate-limited: you may log no more than 1000 loglines within a 10 second rolling interval per run. Additional loglines above this limit will be dropped. If you need to log a lot of lines at once, we recommend outputting that data to a file instead.
Diagnosing run failures
When a run fails the reason usually the reason why is obvious—just check the run logs for an error message.
However, runs can sometimes fail in ways that do not print errors to logs.
If a run uses a custom Docker image (via the
--docker-image flag), and the Docker image does not satisfy Spell's requirements for such images, the run will fail during the "build" step, possibly without error logs. For details, refer to the blog post "Using custom Docker images in Spell runs and workspaces". To fix this error, modify your custom Docker image to satisfy these requirements.
If a run runs out of resources (disk, memory, or CPU) during execution, the OS kernel is very likely to kill the run process without logging an explicit error. The Docker container running your code will exit with exit code
137 (if the kernel sends a
143 (if the kernel sends a
SIGTERM). The exit code Spell received from the container is displayed in the "Status" field on the run details card on the web console. If this value is
Complete (137) or
Complete (143), you know that the run container was killed by the OS kernel, almost assuredly due to resource exhaustion. To fix this error, either reduce the resource utilization of your code or move your run to more powerful hardware.
Very rarely, the underlying cloud machine will experience a hardware failure that puts the instance in a degraded state. For customers running on GCP, GPU instances may additionally be reclaimed by Google during host maintenence events. When this happens, the cloud provider will almost immediately reclaim the machine. Unfortunately there is currently no easy way to tell that this has happened using Spell alone—you will need to visit your cloud provider's console and check up on the instance there.
(Advanced) Using custom public Docker images
If you have use case that requires environment configuration deeper than what the
conda-file flags can provide, you can chose to provide Spell a custom Docker image instead using the
docker-image flag. This parameter takes a container image URL as input, e.g. a URL in the form
<domain>/<repository>/<image_name>:<tag>. The domain and repository parts of the path are both optional, if you omit them we default to public DockerHub (
https://hub.docker.com/_/) will be used. For example:
$ spell run -f --docker-image 'python:latest' \ 'python -c "print(\"Hello World\")"'
--docker-image flag can be combined with other environment configuration flags as needed.
Custom Docker images must satisfy certain requirements imposed by the Spell image build process. For details, refer to the blog post "Using custom Docker images in Spell runs and workspaces".
(Advanced) Using custom private Docker images
For a demonstration of how you can use this feature to power customize your workspace refer to our blog post: "Using custom Docker images in Spell runs and workspaces".
Users on the community edition must pull from a public registry. Users on Spell For Teams may pull from private Amazon Elastic Container Registry (ECR), Google Container Registry (GCR), or Azure Container Registry (ACR) instead.
docker-image flag on the
spell run command can be used to initialize a run environment using an image pulled from a Docker registry. Teams users that have configured their own cluster can use the private registry of the cluster's cloud provider using the
spell cluster add-docker-registry command.
If your cluster is on AWS for example, you can use the command to add one of the ECR repos currently available in that AWS account. This will update the IAM permissions that Spell uses to control your cluster allowing it access to the repo. For Azure clusters, specify the name of the blob container in your storage account using the
$ spell cluster add-docker-registry This command will - Allow Spell to get authorization tokens to access your docker registry - If no repository is specified, list your repositories in the registry - Add read permissions for that repository to the IAM role associated with the cluster All of this will be done with your AWS profile 'default' which has Access Key ID 'ABCDEFGHIJKL' and region 'us-west-2' - continue? [y/N]: y Spell does not yet have access to the following repos found in your AWS account: - image-processing - text-generation - image-gan Please choose a repository: text-generation ... Successfully added read permission to text-generation
Once set up, you can use the
--docker-image argument to
spell run to specify a Docker image pushed to the private ECR repository.
The registry permissions can be removed at any time with
spell cluster delete-docker-registry.
Private custom Docker images, like public custom Docker images, are subject to certain requirements imposed by the Spell image build process. For details, refer to the blog post "Using custom Docker images in Spell runs and workspaces".
(Advanced) Installing packages from the local environment
You may have Spell inherit its packages from the local virtual environment you execute the
spell run command within. This is done using the
$ spell run --deps-from-env "python train.py"
If you run this command from within a Virtualenv, Pipenv, or Poetry environment, the pip-chill module will be used to marshall an environment definition. If you run this command from within a Conda environment, the output of the
conda env export --from-history command will be used instead.
Packages installed this way will be overwritten by packages installed using the
(Advanced) Specifying an early stopping condition
Early stopping is the technique of ending a training run early, before the model training process has reached its maximum number of epochs, based on lack of improvement in the model accuracy.
We support a form of early stopping directly in the CLI using the
stop-condition flag. For example,
--stop-condition "keras/val_acc < 0.5 : 10" will stop model training if the run doesn't reach a validation accuracy of 50 percent by the tenth time this metric is logged (e.g. by the tenth epoch, if you are logging once per epoch), or if the validation accuracy metric dips below 50 percent in subsequent breakpoints.
>= operators. The second parameter is optional. If left unspecified, the early stopping check will be every single time the training metric is logged.
(Advanced) Run states
Runs transition through the following states:
- Requested: the run has been created, and is queued for execution. Typically, a run will transition through this state very quickly. However, a run could stay
Requestedfor an extended period of time for two reasons:
- You already have the maximum number of concurrent runs executing that is allowed per your plan. If this is the case, your run will transition to
Buildingwhen the number of concurrent runs you have falls below the limit.
- Spell received a number of runs at the same time and is starting up more machines to execute your run. If this is the case, your run will transition to
Buildingas soon as a machine is ready.
- Building: the environment for your run is created. This includes installing any dependencies that are specified (e.g.
--aptparameters provided on the command line to create the run) and copying the code from your Git repository into the run. First we check to see if the run uses an environment from a previous run, and if so we used the cached environment to minimize build time in lieu of building the environment again.
- Mounting: the resources (if any) that were specified are mounted into the run. (See the Resources section for more information).
- Running: your command is executing!
- Saving: any new or modified files from your command are saved as a resource into
- Pushing: your environment is pushed to our cache of environments to expedite a future run with the same environment. Usually, your run will transition through this step very quickly since this push happens asynchronously in the background throughout every other state. The push is usually completed by the time this state is reached unless your run is very quick.
All runs eventually transition to one of the following final states:
- Complete: this is the normal state for a run to transition to upon completion.
- Killed: the run was killed using the
spell killcommand. When a run is killed, it does not transition through any subsequent steps and is immediately terminated.
- Stopped: the run was stopped using the
spell stopcommand. A stopped run still transitions through the
Pushingstates, however its
Runningstate is immediately exited whenever the
spell stopcommand is issued.
- Failed: the run did not complete due to an error in some other state.
- Interrupted: the run did not complete because it was executing on a spot instance that got reclaimed (this state is only possible when using spot instances on Spell for Teams).