What Is a Run
This page is a deep dive into the core unit of work on Spell: the run.
Anatomy of the run command
spell run command is used to create runs and is likely the command you'll use most while using Spell. One of the simplest commands you can run is
spell run "echo hello world", which will print
hello world to the screen. In this case, only
spell run is related to the Spell command, while
echo hello world is your regular command:
$ spell run "echo hello world"
A more realistic example:
$ spell run --mount runs/1/data:datasets \ --machine-type T4 \ --apt ffmpeg \ --pip moviepy \ "python style.py \ --checkpoint-dir ckpt \ --style images/style/my-style-image.jpg \ --style-weight 1.5e2 \ --train-path datasets/train2014 \ --vgg-path datasets/imagenet-vgg-verydeep-19.mat"
In this example, we've mounted a file using the spell
--mount flag, specified that we want a T4 GPU with the
--machine-type flag, and added both an
apt and a
pip dependency. These flags add the correct files and dependencies to your spell environment. Then everything in the quotation marks is the regular command from the fast style transfer repository.
As this command demonstrates,
spell run has a large number of (mostly environment configuration) options. There are more details in the corresponding sections of the user guide, we will briefly mention the most important ones here:
mountallows you to mount resources from SpellFS (or potentially other object storage systems) into your run. This is how we recommend loading your datasets into a run.
machine-typeallows you to specify your instance type. For more details, see the page on Instance types.
frameworkallows you to specify the base machine learning framework image to use. Our default framework contains all of the common machine learning libraries (TensorFlow, PyTorch, etcetera), but may potentially be more out of date than a framework-specific image.
conda-fileallow you to customize the packages installed in the environment.
github-urllets you initialize a run with files from a particular GitHub repository. This is handy if you don't already have the code locally.
All runs get a unique run ID. The run ID can be found in the
✨ Casting spell #<ID>… message that appears after a run is created. Run IDs are assigned in ascending sequence. All runs are given an ID, including workspace runs and TensorBoard runs, although these two types of runs do not show up as entries on your run history page.
Run IDs are useful if you want to:
- Refer back to all the information from your run. Logs, metrics, and outputs from a run are available on the web console.
- Mount the outputs from your run into another run. This pattern of using the output of one run as the input to another run is so common that we've built an entire feature, workflows, based on this idea.
How runs interact with git
A run is typically initiated from a Git repository on the machine where you are using Spell, but it does not have to be. When initiated from a Git repository, Spell uses Git to safely sync the code in your current directory to the remote machine. Turning any directory into a Git repository is very easy - just run
If the run is initiated from a Git repository, the run will automatically sync code from the repository and use it within the run. If the current repo has uncommitted changes, these changes will also be synced and used within the run, and can be downloaded as a separate patch via the web console. If there are any untracked files, the run will provide a warning. This is to avoid any ambiguity about the state of your code being transferred to Spell.
- Spell uses SSH to sync your code. If you require special SSH configuration (e.g. proxy commands for network firewalls) update the
ssh_configfile in your Spell config directory. This will likely be in
~/.spellon macOS or Linux or
- Spell does not support
git-lfs. If you have large files inside your repository, we recommend using the
- Git submodules are not yet supported either. As a workaround you can directly commit the submodules into the parent repo.
- It is possible to configure Spell to error if uncommitted changes are detected. This can help enforce tracking code changes and improve reproducibility. Configure this in the
configfile in your Spell config directory.
Everything in the commit that is currently checked out (unless the
--commit-ref option is provided to specify a different commit) will be available in the run.
For example, if you go through the steps below, you will see
running on Spell print out during your run, showing that the file in your current Git commit was transferred to Spell for the run:
$ mkdir project
$ cd project
$ echo 'running on Spell!' > file
$ git add file && git commit -m 'first commit'
$ spell run cat file
Note that the run will execute in whatever the local current working directory is. For example, if you are working in a nested directory within your Git repository, you can run commands there without worrying about the path relative to the Git repository root.
Dependency management and customization
There are several ways to modify the code packages available in a run. The highest level of customization is your choice of framework, specified using the
--framework command line option. E.g.:
$ spell run --framework tensorflow2 "python train.py"
This provides the following options:
|Name||CLI arguments||Operating system||Python frameworks|
Spell provides a number of framework environments for you to run your code in. All framework environments have the Ubuntu operating system, CUDA, cuDNN, and the following Python packages:
requests pandas numpy matplotlib scikit-learn xgboost
These versions of these packages can be overridden using the
--pip flag or using a conda environment.
Using the current virtual environment
$ spell run --deps-from-env "python train.py"
This flag will detect which environment is activated and extract the current dependencies using
conda env export --from-history for conda environments and pip-chill for other environments. If inside both a conda environment and another environment, the conda environment will be selected. Any packages which already exist in the framework will be ignored, and the default versions installed on the framework will be used instead.
Using requirements.txt files
If you are installing a lot of packages at once, it is probably more convenient to use
pip-req, passing in a path to a requirements.txt file:
$ spell run --pip-req requirements.txt "python train.py"
The requirements file should be a valid pip requirements file. For example:
# requirements.txt sklearn imageio
Spell also supports installing pip packages from private GitHub repositories via our GitHub integration, available to users on the Teams plan. See the Integrating GitHub docs for details.
If a package is specified in both the requirements.txt file and in the environment determined from
--deps-from-env flag, the requirements.txt file will override the package in the environment.
Using conda envrionment files
You can add conda packages by supplying a conda requirements file file via the
$ spell run --conda-file ./environment.yml "python train.py"
The conda environment name in the run on Spell will be
Installing additional pip packages
You can add pip packages using the
--pip option followed by the name of the package.
$ spell run --pip sklearn "python train.py"
To add multiple packages, you can repeat the
$ spell run --pip sklearn --pip imageio "python train.py"
This flag can also be used to override the existing framework package versions and the any versions specified in either the requirements.txt or the environment determined from
Installing additional apt packages
To modify the system packages included in the environment, use the
apt packages are added much in the same way as
pip packages, via a command line option:
--apt <package>. Spell maintained frameworks are built on the Ubuntu 18.04 Linux distribution, so any package reachable from the default set of package indexes shipped with Ubuntu 18.04 can be specified.
$ spell run --apt libprotobuf-dev --apt protobuf-compiler "python train.py"
Combining package managers
You can even mix and match package manager commands in a single command:
$ spell run --apt ffmpeg --pip cudatoolkit\>=9.0 "echo 'hello world'"
Notice that the commands allow you to specify specific versions of packages using
<= operators. Lesser-than and greater-than are reserved characters in the CLI so you will probably need to escape these characters using a backslash, as above.
To add resources (such as datasets) to a run, mount them using the
--mount option. To use the mount option, specify the resource path to your dataset and the path to mount the dataset to on Spell. For example, the following code snippet mounts an audio dataset to the
/mnt/audio-data path and prints out its contents from inside of the run.
$ spell run --mount public/audio/css10:/mnt/audio-data "ls /mnt/audio-data"
💫 Casting spell #9… ✨ Stop viewing logs with ^C ✨ Machine_Requested… done ✨ Building… done ✨ Mounting… done ✨ Run is running chinese-single-speaker-speech-dataset dutch-single-speaker-speech-dataset finnish-single-speaker-speech-dataset french-single-speaker-speech-dataset german-single-speaker-speech-dataset greek-single-speaker-speech-dataset hungarian-single-speaker-speech-dataset japanese-single-speaker-speech-dataset korean-single-speaker-speech-dataset russian-single-speaker-speech-dataset spanish-single-speaker-speech-dataset
We strongly recommend using absolute mount paths for legibility. As a best practice, we recommend using the
/mnt directory as the root directory for your data.
Alternatively, you can choose omit all or part of the mount path, in which case the default path and/or default name will be used:
$ spell run --mount public/audio/css10:audio-data "python main.py" $ spell run --mount public/audio/css10 "python main.py"
In this case the files will land in the current working directory inside of the run (typically
/spell/$REPO_NAME, assuming you started the run from the root of a GitHub directory). To learn more about resources see the page "What is a resource?".
Any files that a Spell run saves within the
/spell/ working directory are automatically saved to the
runs/$RUNID directory in our virtual filesystem, SpellFS, at the end of the run. These files will then be available to browse, download, or mount into a different run. For example:
$ spell run "echo hello world! > foo.txt" $ spell run --mount runs/435/foo.txt:/spell/foo.txt cat foo.txt hello world!
Additionally, files saved to a mounted directory inside of a Spell run also get backed up.
Again, to learn more about resources see the page "What is a resource?".
Setting environment variables
Environment variables can be passed to a run using the
$ spell run --env CUDA_VISIBLE_DEVICES=1 "python example.py"
Using code from a GitHub repo
As explained in the section "How runs interact with git" above, the default workflow for using a run involves initializing it from a git repository. That git repository is loaded onto the machine, and its files are usable by the run.
You can specify an alternative GitHub code repository using the
--github-url flag. This will use the
master branch by default. You can specify an alternative branch, commit hash, or git ref using the
--github-ref flag. For example:
$ spell run \ --github-url 'https://github.com/spellml/examples.git' \ --github-ref '622d64' \ 'git log -n 1 | cat'
Note that on the community edition of Spell the code repository has to be public. Spell for Teams allows you to pull code from private repositories using our GitHub integration. For more info on this see the page "Integrating GitHub".
Interrupting a run
A run can be stopped or killed at any time.
$ spell stop $RUN_ID $ spell kill $RUN_ID
Stopping a run will send a graceful shutdown signal: the run will save all of its contents to disk before exiting, and all of the files that the run has generated so far (model checkpoints, for example) will be available in the SpellFS.
Killing a run will send a hard shutdown signal. The machine will be interrupted and the run will be killed without being given the opportunity to save to disk.
We recommend only killing runs whose output you are sure you will not need, for example, runs whose code has a bug. Machines whose runs get killed get recycled faster because they do not have to upload any data to object storage, useful when you are debugging. For runs which are long-lived, stopping the run is usually much more appropriate.
Note that the stop and kill commands are also available in the web console.
(Advanced) Using custom public Docker images
If you have use case that requires environment configuration deeper than what the
conda-file flags can provide, you can chose to provide Spell a custom Docker image instead using the
docker_image flag. This parameter takes a container image URL as input, e.g. a URL in the form
<domain>/<repository>/<image_name>:<tag>. The domain and repository parts of the path are both optional, if you omit them we default to public DockerHub (
https://hub.docker.com/_/) will be used. For example:
$ spell run -f --docker-image 'python:latest' \ 'python -c "print(\"Hello World\")"'
Note that the
docker_image flag is exclusive—it cannot be combined with any of the other environment configuration flags, e.g.
(Advanced) Using custom private Docker images
This feature is only available on Spell for Teams.
For a demonstration of how you can use this feature to power customize your workspace refer to our blog post: "Using custom Docker images in Spell runs and workspaces".
Users on the community edition must pull from a public registry. Users on Spell For Teams may pull from private AWS ECS or GCP Container Registry instead.
docker_image flag on the
spell run command can be used to initialize a run environment using an image pulled from a Docker registry. Teams users that have configured their own cluster can use the private registry of the cluster's cloud provider (ECR for AWS, GCR for GCP) using the
spell cluster add-docker-registry command.
If your cluster is on AWS for example, you can use the command to add one of the ECR repos currently available in that AWS account. This will update the IAM permissions that Spell uses to control your cluster allowing it access to the repo.
$ spell cluster add-docker-registry This command will - Allow Spell to get authorization tokens to access your docker registry - If no repository is specified, list your repositories in the registry - Add read permissions for that repository to the IAM role associated with the cluster All of this will be done with your AWS profile 'default' which has Access Key ID 'ABCDEFGHIJKL' and region 'us-west-2' - continue? [y/N]: y Spell does not yet have access to the following repos found in your AWS account: - image-processing - text-generation - image-gan Please choose a repository: text-generation ... Successfully added read permission to text-generation
Once set up, you can use the
--docker-image argument to
spell run to specify a docker image
pushed to the private ECR repository.
The ECR or GCR permissions can be removed at any time with
spell cluster delete-docker-registry.
(Advanced) Mounting public buckets
You can mount data from public AWS S3 or GCP GS buckets into a run. For example, in the next example we mount monthly rainfall figures from the public SILO climate dataset and give it the alias
$ spell run \ -m 's3://silo-open-data/annual/monthly_rain:data' \ 'python main.py'
Users on Spell for Teams can mount data from private S3 and/or GS buckets using our AWS and GCP integrations. See Cluster Bucket Management for more details.
(Advanced) Specifying an early stopping condition
Early stopping is the technique of ending a training run early, before the model training process has reached its maximum number of epochs, based on lack of improvement in the model accuracy.
We support a form of early stopping directly in the CLI using the
stop-condition flag. For example,
--stop-condition "keras/val_acc < 0.5 : 10" will stop model training if the run doesn't reach a validation accuracy of 50 percent within 10 epochs, or if the validation accuracy metric dips below 50 percent in subsequent epochs. Early stopping conditions can be defined using either automatic and custom user metrics.
>= operators. The second parameter is optional; if left unspecified, the early stopping check will be applied to every epoch of training.
(Advanced) Run states
Runs transition through the following states:
- Requested: the run has been created, and is queued for execution. Typically, a run will transition through this state very quickly. However, a run could stay
Requestedfor an extended period of time for two reasons:
- You already have the maximum number of concurrent runs executing that is allowed per your plan. If this is the case, your run will transition to
Buildingwhen the number of concurrent runs you have falls below the limit.
- Spell received a number of runs at the same time and is starting up more machines to execute your run. If this is the case, your run will transition to
Buildingas soon as a machine is ready.
- Building: the environment for your run is created. This includes installing any dependencies that are specified (e.g.
--aptparameters provided on the command line to create the run) and copying the code from your Git repository into the run. First we check to see if the run uses an environment from a previous run, and if so we used the cached environment to minimize build time in lieu of building the environment again.
- Mounting: the resources (if any) that were specified are mounted into the run. (See the Resources section for more information).
- Running: your command is executing!
- Saving: any new or modified files from your command are saved as a resource into
- Pushing: your environment is pushed to our cache of environments to expedite a future run with the same environment. Usually, your run will transition through this step very quickly since this push happens asynchronously in the background throughout every other state. The push is usually completed by the time this state is reached unless your run is very quick.
All runs eventually transition to one of the following final states:
- Complete: this is the normal state for a run to transition to upon completion.
- Killed: the run was killed using the
spell killcommand. When a run is killed, it does not transition through any subsequent steps and is immediately terminated.
- Stopped: the run was stopped using the
spell stopcommand. A stopped run still transitions through the
Pushingstates, however its
Runningstate is immediately exited whenever the
spell stopcommand is issued.
- Failed: the run did not complete due to an error in some other state.
- Interrupted: the run did not complete because it was executing on a spot instance that got reclaimed (this state is only possible when using spot instances on Spell for Teams).