What Is a Run
This page is a deep dive into the core unit of work on Spell: the run.
Anatomy of the run command
The spell run
command is used to create runs and is likely the command you'll use most while using Spell. One of the simplest commands you can run is spell run "echo hello world"
, which will print hello world
to the screen. In this case, only spell run
is related to the Spell command, while echo hello world
is your regular command:
$ spell run "echo hello world"
A more realistic example:
$ spell run --mount runs/1/data:datasets \
--machine-type T4 \
--apt ffmpeg \
--pip moviepy \
"python style.py \
--checkpoint-dir ckpt \
--style images/style/my-style-image.jpg \
--style-weight 1.5e2 \
--train-path datasets/train2014 \
--vgg-path datasets/imagenet-vgg-verydeep-19.mat"
In this example, we've mounted a file using the spell --mount
flag, specified that we want a T4 GPU with the --machine-type
flag, and added both an apt
and a pip
dependency. These flags add the correct files and dependencies to your spell environment. Then everything in the quotation marks is the regular command from the fast style transfer repository.
As this command demonstrates, spell run
has a large number of (mostly environment configuration) options. There are more details in the corresponding sections of the user guide, we will briefly mention the most important ones here:
mount
allows you to mount resources from SpellFS (or potentially other object storage systems) into your run. This is how we recommend loading your datasets into a run.machine-type
allows you to specify your instance type. For more details, see the page on Instance types.framework
allows you to specify the base machine learning framework image to use. Our default framework contains all of the common machine learning libraries (TensorFlow, PyTorch, etcetera), but may potentially be more out of date than a framework-specific image.apt
,pip
, andconda-file
allow you to customize the packages installed in the environment.github-url
lets you initialize a run with files from a particular GitHub repository. This is handy if you don't already have the code locally.
Run IDs
All runs get a unique run ID. The run ID can be found in the ✨ Casting spell #<ID>…
message that appears after a run is created. Run IDs are assigned in ascending sequence. All runs are given an ID, including workspace runs and TensorBoard runs, although these two types of runs do not show up as entries on your run history page.
Run IDs are useful if you want to:
- Refer back to all the information from your run. Logs, metrics, and outputs from a run are available on the web console.
- Mount the outputs from your run into another run. This pattern of using the output of one run as the input to another run is so common that we've built an entire feature, workflows, based on this idea.
How runs interact with git
A run is typically initiated from a Git repository on the machine where you are using Spell, but it does not have to be. When initiated from a Git repository, Spell uses Git to safely sync the code in your current directory to the remote machine. Turning any directory into a Git repository is very easy - just run git init
.
If the run is initiated from a Git repository, the run will automatically sync code from the repository and use it within the run. If the current repo has uncommitted changes, these changes will also be synced and used within the run, and can be downloaded as a separate patch via the web console. If there are any untracked files, the run will provide a warning. This is to avoid any ambiguity about the state of your code being transferred to Spell.
Note
- Spell uses SSH to sync your code. If you require special SSH configuration (e.g. proxy commands for network firewalls) update the
ssh_config
file in your Spell config directory. This will likely be in~/.spell
on macOS or Linux orAppData/Roaming/spell
on Windows. - Spell does not support
git-lfs
. If you have large files inside your repository, we recommend using thespell upload
command. - Git submodules are not yet supported either. As a workaround you can directly commit the submodules into the parent repo.
- It is possible to configure Spell to error if uncommitted changes are detected. This can help enforce tracking code changes and improve reproducibility. Configure this in the
config
file in your Spell config directory.
Everything in the commit that is currently checked out (unless the --commit-ref
option is provided to specify a different commit) will be available in the run.
For example, if you go through the steps below, you will see running on Spell
print out during your run, showing that the file in your current Git commit was transferred to Spell for the run:
$ mkdir project
$ cd project
$ echo 'running on Spell!' > file
$ git add file && git commit -m 'first commit'
$ spell run cat file
Note that the run will execute in whatever the local current working directory is. For example, if you are working in a nested directory within your Git repository, you can run commands there without worrying about the path relative to the Git repository root.
Dependency management and customization
Customizing frameworks
There are several ways to modify the code packages available in a run. The highest level of customization is your choice of framework, specified using the --framework
command line option. E.g.:
$ spell run --framework tensorflow "python train.py"
This provides the following options:
Name | CLI arguments | Notes |
---|---|---|
Default | (none) | tensorflow==1.15.4 keras==2.3.1 torch==1.5.0 torchvision==0.6.0 pytorch-lightning==0.8.4 |
TensorFlow 2.0 | --framework tensorflow2 |
tensorflow==2.2.0 |
Spell provides a number of framework environments for you to run your code in. All framework environments have the Ubuntu 18.04 operating system and Python version 3.7. If a machine type is specified that has a GPU, then CUDA and cuDNN are also included. Additionally, the following Python packages always come pre-installed:
requests
pandas
numpy
matplotlib
scikit-learn
xgboost
Installing additional pip packages
You can add pip packages using the --pip
option followed by the name of the package.
$ spell run --pip sklearn "python train.py"
To add multiple packages, you can repeat the --pip
command:
$ spell run --pip sklearn --pip imageio "python train.py"
If you are installing a lot of packages at once, it is probably more convenient to use pip-req
, passing in a path to a requirements.txt file:
$ spell run --pip-req requirements.txt "python train.py"
The requirements file should be a valid pip requirements file. For example:
# requirements.txt
sklearn
imageio
Spell also supports installing pip packages from private GitHub repositories via our GitHub integration, available to users on the Teams plan. See the Integrating GitHub docs for details.
Installing additional conda packages
You can add conda packages by supplying a conda requirements file file via the --conda-file
option.
$ spell run --conda-file ./environment.yml "python train.py"
The conda environment name in the run on Spell will be spell
.
Installing additional apt packages
To modify the system packages included in the environment, use the apt
option.
apt
packages are added much in the same way as pip
packages, via a command line option: --apt <package>
. Spell maintained frameworks are built on the Ubuntu 18.04 Linux distribution, so any package reachable from the default set of package indexes shipped with Ubuntu 18.04 can be specified.
$ spell run --apt libprotobuf-dev --apt protobuf-compiler "python train.py"
Combining package managers
You can even mix and match package manager commands in a single command:
$ spell run --framework fastai --apt ffmpeg --pip cudatoolkit\>=9.0 "echo 'hello world'"
Notice that the commands allow you to specify specific versions of packages using >
, >=
, ==
, <
, and <=
operators. Lesser-than and greater-than are reserved characters in the CLI so you will probably need to escape these characters using a backslash, as above.
Mounting resources
To add resources (such as datasets) to a run, mount them using the --mount
option. To use the mount option, specify the resource path to your dataset and the path to mount the dataset to on Spell. For example, the following code snippet mounts an audio dataset to the /mnt/audio-data
path and prints out its contents from inside of the run.
$ spell run --mount public/audio/css10:/mnt/audio-data "ls /mnt/audio-data"
💫 Casting spell #9…
✨ Stop viewing logs with ^C
✨ Machine_Requested… done
✨ Building… done
✨ Mounting… done
✨ Run is running
chinese-single-speaker-speech-dataset
dutch-single-speaker-speech-dataset
finnish-single-speaker-speech-dataset
french-single-speaker-speech-dataset
german-single-speaker-speech-dataset
greek-single-speaker-speech-dataset
hungarian-single-speaker-speech-dataset
japanese-single-speaker-speech-dataset
korean-single-speaker-speech-dataset
russian-single-speaker-speech-dataset
spanish-single-speaker-speech-dataset
We strongly recommend using absolute mount paths for legibility. As a best practice, we recommend using the /mnt
directory as the root directory for your data.
Alternatively, you can choose omit all or part of the mount path, in which case the default path and/or default name will be used:
$ spell run --mount public/audio/css10:audio-data "python main.py"
$ spell run --mount public/audio/css10 "python main.py"
In this case the files will land in the current working directory inside of the run (typically /spell/$REPO_NAME
, assuming you started the run from the root of a GitHub directory). To learn more about resources see the page "What is a resource?".
Saving resources
Any files that a Spell run saves within the /spell/
working directory are automatically saved to the runs/$RUNID
directory in our virtual filesystem, SpellFS, at the end of the run. These files will then be available to browse, download, or mount into a different run. For example:
$ spell run "echo hello world! > foo.txt"
$ spell run --mount runs/435/foo.txt:/spell/foo.txt cat foo.txt
hello world!
Additionally, files saved to a mounted directory inside of a Spell run also get backed up.
Again, to learn more about resources see the page "What is a resource?".
Setting environment variables
Environment variables can be passed to a run using the env
flag.
$ spell run --env CUDA_VISIBLE_DEVICES=1 "python example.py"
Using code from a GitHub repo
As explained in the section "How runs interact with git" above, the default workflow for using a run involves initializing it from a git repository. That git repository is loaded onto the machine, and its files are usable by the run.
You can specify an alternative GitHub code repository using the --github-url
flag. This will use the master
branch by default. You can specify an alternative branch, commit hash, or git ref using the --github-ref
flag. For example:
$ spell run \
--github-url 'https://github.com/spellml/examples.git' \
--github-ref '622d64' \
'git log -n 1 | cat'
Note that on the community edition of Spell the code repository has to be public. Spell for Teams allows you to pull code from private repositories using our GitHub integration. For more info on this see the page "Integrating GitHub".
Interrupting a run
A run can be stopped or killed at any time.
$ spell stop $RUN_ID
$ spell kill $RUN_ID
Stopping a run will send a graceful shutdown signal: the run will save all of its contents to disk before exiting, and all of the files that the run has generated so far (model checkpoints, for example) will be available in the SpellFS.
Killing a run will send a hard shutdown signal. The machine will be interrupted and the run will be killed without being given the opportunity to save to disk.
We recommend only killing runs whose output you are sure you will not need, for example, runs whose code has a bug. Machines whose runs get killed get recycled faster because they do not have to upload any data to object storage, useful when you are debugging. For runs which are long-lived, stopping the run is usually much more appropriate.
Note that the stop and kill commands are also available in the web console.
(Advanced) Using custom public Docker images
If you have use case that requires environment configuration deeper than what the apt
, pip
, and conda-file
flags can provide, you can chose to provide Spell a custom Docker image instead using the docker_image
flag. This parameter takes a container image URL as input, e.g. a URL in the form <domain>/<repository>/<image_name>:<tag>
. The domain and repository parts of the path are both optional, if you omit them we default to public DockerHub (https://hub.docker.com/_/
) will be used. For example:
$ spell run -f --docker-image 'python:latest' \
'python -c "print(\"Hello World\")"'
Note that the docker_image
flag is exclusive—it cannot be combined with any of the other environment configuration flags, e.g. pip
, conda-file
, etcetera.
(Advanced) Using custom private Docker images
Note
This feature is only available on Spell for Teams.
For a demonstration of how you can use this feature to power customize your workspace refer to our blog post: "Using custom Docker images in Spell runs and workspaces".
Users on the community edition must pull from a public registry. Users on Spell For Teams may pull from private AWS ECS or GCP Container Registry instead.
The docker_image
flag on the spell run
command can be used to initialize a run environment using an image pulled from a Docker registry. Teams users that have configured their own cluster can use the private registry of the cluster's cloud provider (ECR for AWS, GCR for GCP) using the spell cluster add-docker-registry
command.
If your cluster is on AWS for example, you can use the command to add one of the ECR repos currently available in that AWS account. This will update the IAM permissions that Spell uses to control your cluster allowing it access to the repo.
$ spell cluster add-docker-registry
This command will
- Allow Spell to get authorization tokens to access your docker registry
- If no repository is specified, list your repositories in the registry
- Add read permissions for that repository to the IAM role associated with the cluster
All of this will be done with your AWS profile 'default' which has Access Key ID 'ABCDEFGHIJKL' and region 'us-west-2' - continue? [y/N]: y
Spell does not yet have access to the following repos found in your AWS account:
- image-processing
- text-generation
- image-gan
Please choose a repository: text-generation
...
Successfully added read permission to text-generation
Once set up, you can use the --docker-image
argument to spell run
to specify a docker image
pushed to the private ECR repository.
The ECR or GCR permissions can be removed at any time with spell cluster delete-docker-registry
.
(Advanced) Mounting public buckets
You can mount data from public AWS S3 or GCP GS buckets into a run. For example, in the next example we mount monthly rainfall figures from the public SILO climate dataset and give it the alias data
.
$ spell run \
-m 's3://silo-open-data/annual/monthly_rain:data' \
'python main.py'
Users on Spell for Teams can mount data from private S3 and/or GS buckets using our AWS and GCP integrations. See Cluster Bucket Management for more details.
(Advanced) Specifying an early stopping condition
Early stopping is the technique of ending a training run early, before the model training process has reached its maximum number of epochs, based on lack of improvement in the model accuracy.
We support a form of early stopping directly in the CLI using the stop-condition
flag. For example, --stop-condition "keras/val_acc < 0.5 : 10"
will stop model training if the run doesn't reach a validation accuracy of 50 percent within 10 epochs, or if the validation accuracy metric dips below 50 percent in subsequent epochs. Early stopping conditions can be defined using either automatic and custom user metrics.
We support <
, >
, <=
, and >=
operators. The second parameter is optional; if left unspecified, the early stopping check will be applied to every epoch of training.
(Advanced) Run states
Runs transition through the following states:
- Requested: the run has been created, and is queued for execution. Typically, a run will transition through this state very quickly. However, a run could stay
Requested
for an extended period of time for two reasons: - You already have the maximum number of concurrent runs executing that is allowed per your plan. If this is the case, your run will transition to
Building
when the number of concurrent runs you have falls below the limit. - Spell received a number of runs at the same time and is starting up more machines to execute your run. If this is the case, your run will transition to
Building
as soon as a machine is ready. - Building: the environment for your run is created. This includes installing any dependencies that are specified (e.g.
--pip
or--apt
parameters provided on the command line to create the run) and copying the code from your Git repository into the run. First we check to see if the run uses an environment from a previous run, and if so we used the cached environment to minimize build time in lieu of building the environment again. - Mounting: the resources (if any) that were specified are mounted into the run. (See the Resources section for more information).
- Running: your command is executing!
- Saving: any new or modified files from your command are saved as a resource into
runs/
- Pushing: your environment is pushed to our cache of environments to expedite a future run with the same environment. Usually, your run will transition through this step very quickly since this push happens asynchronously in the background throughout every other state. The push is usually completed by the time this state is reached unless your run is very quick.
All runs eventually transition to one of the following final states:
- Complete: this is the normal state for a run to transition to upon completion.
- Killed: the run was killed using the
spell kill
command. When a run is killed, it does not transition through any subsequent steps and is immediately terminated. - Stopped: the run was stopped using the
spell stop
command. A stopped run still transitions through theSaving
andPushing
states, however itsRunning
state is immediately exited whenever thespell stop
command is issued. - Failed: the run did not complete due to an error in some other state.
- Interrupted: the run did not complete because it was executing on a spot instance that got reclaimed (this state is only possible when using spot instances on Spell for Teams).