This page is a high-level overview of Spell's most important features. For more detailed discussions of specific features, refer to the corresponding sections of the User Guide.
A run is a single instance of a computational job executed on Spell.
Runs take an environment definition, a set of resources (datasets), and code as input, execute it on the cloud, and provide the files created over the course of the run as output. Runs are extensively configurable. The most important options are:
- Compute instance: options range from basic CPU to powerful GPU servers.
- Frameworks: we provide base environments for TensorFlow, PyTorch, Fast.ai, and others. Or, roll your own.
- Packages: you can install any code packages you need using pip, conda, and apt.
- Resources: datasets to be mounted onto the container.
Runs are the atomic unit of work in Spell. We do a lot of work behind the scenes to make runs easy to use and ergonomic to your workflow. To learn more about refer to the Run Overview.
A workspace is an instance of a Jupyter Notebook or JupyterLab environment running on the cloud. Under the hood, workspaces are still just runs, and can be configured in all the same ways.
Workspaces provide a flexible work environment on your choice of CPU and GPU hardware that you can easily spin up and spin down as needed. We manage the data storage and compute environment for you so that you can focus on the code.
To learn more about workspaces refer to the Workspace Overview.
Resources is the generic name for the datasets, models, or any other files made available to a run. Spell keeps these organized for you in a remote filesystem called SpellFS.
The resources associated with your account are split between
public getting-started assets that we provide (e.g. example data),
uploads that you push to SpellFS, and
runs outputs that you create during your runs. Mounting S3 and GCS buckets is also possible. To learn more about resources, see What is a Resource.
Spell automatically collects all of the hardware metrics and some of the model metrics generated as part of a run, outputting them to the run summary page.
You can extend this system to log your own custom model metrics on Spell. To learn more on how metrics work refer to the guide on Metrics.
Runs can be linked together using workflows. Workflows allow you to create an arbitrary DAG of run, where each run gets kicked off as soon as the previous runs it is connected to complete.
Workflows are particularly convenient for long running pipelines or scripts, since you don't have to worry about keeping your personal computer up and running for the duration of the script. To learn more about workflows see the What is a Workflow? page.
Model hyperparameters control different aspects of the learning process of a machine learning model. Hyperparameter search is the process of finding the values for these numbers that produce the most performant model.
You can launch a hyperparameter search on Spell directly from the command line. We spin up and manage a pool of worker machines, and handle partitioning your search across a set of runs for you.
To learn more about hyperparameter search see the Hyperparameter Searches guide.
Hyperparameter searches are currently only available on Spell for Teams.
Distributed runs speed up the training of (typically very large) models by partitioning the training process across multiple machine. The Horovod distributed training framework is a first-class citizen in Spell. Spell supports training horovod-enabled models across as many machine instances as you need.
To learn more about distributed training check out our guide to Distributed Runs.
Distributed runs are currently only available on Spell for Teams.
Model servers allow you to serve machine learning models on a Kubernetes cluster managed for you by Spell. Model servers are designed to make it trivially easy to productionize models trained on Spell, allowing you to use one tool for both your model training and model serving.
Model servers are currently only available on Spell for Teams.
To learn more about model servers see the Model Servers guide.
That concludes our high-level tour of Spell! This list is not exhaustive, we support many other features like:
- TensorBoard support
- Integration with Weights & Biases
- Cluster management
- Private machine types
- Early stopping
- And more...
To learn more about Spell, check out our blog. For details on specific Spell features, refer to the corresponding section of the User Guide.