For users on the Spell for Teams plan, Spell supports using private machines that you own at your office, datacenter, or home to be added as workers within your Spell cluster.
- Ubuntu 18.04 operating system or newer (for DGX users, this means DGX OS 4 or newer).
- Docker installed. Example:
sudo apt install docker.io.
- At least 50 GB of free hard drive space, although we recommend significantly more for any complex workloads.
For GPU-enabled training you will also need:
- An NVIDIA GPU with recent drivers installed (we test on driver version 440: get it with
sudo apt install nvidia-driver-440).
- The NVIDIA Container Toolkit must be installed. See here for instructions.
Creating a new private machine type
Clusters click on your cluster, then click on "Add New Machine Type". At the top of the modal that pops up, there will be an option to select either "Cloud Instance" or "Private Machine". Selecting "Private Machine" will update the options to only include the subset of options relevant to private machines:
- Name. This name will be referenced by the
--machine_typeparameter when you create runs.
- Additional Images. All private machine workers will have with the
defaultframework (TensorFlow 2, PyTorch, and Conda) image installed onto them when connected to Spell. Use these checkboxes to attach additional framework images. Consult the section "Available frameworks" for details.
After clicking "Create" you will be shown an API key created for this new machine type. Keep this API key safe and don't share it without anyone. Copy it down for the next step: you will use it to register your private machines with Spell. Don't worry if you lose it, you can always return to the cluster page and get it again.
Install the Spell worker service
Log in to your machine. In the terminal, download the debian package for the Spell worker service:
$ wget https://apt.spell.ml/spell-worker-service.deb
Install the Spell Worker Service by running
$ sudo apt install ./spell-worker-service.deb
Add the API key in the installation wizard prompt. After successful installation, your machine should be visible in the Clusters page. It may stay in the "Starting" state for a bit while the machine is connecting to Spell and Spell is downloading the frameworks to the machine's docker
Debugging the Spell worker service
The service should start automatically when the package is installed. If for any reason it's not working as expected you can use the following commands to help debug.
To show some status information and log snippets:
$ systemctl status spell-worker.service
To shows all Spell worker service logs:
$ journalctl -u spell-worker.service
To control Spell worker service's execution:
$ sudo systemctl [start | stop | restart] spell-worker.service
It shouldn't be necessary to run these commands, but if something is not behaving as expected the above commands can be helpful in identifying the issue.
Moving and deleting machines
In order to move a private machine from one machine type to another, you need to first delete the machine from its current machine type. This is done by clicking the blue "x" in the rightmost column of the machine details table. Once removed, you can register the machine with a new machine type by getting the new machine type's API key and running the following command on the machine itself:
$ sudo dpkg-reconfigure spell-worker-service
Or you can remove the machine from Spell altogether (don't worry, you can always add it back later) by running the following command:
$ sudo apt purge spell-worker-service
Deleting a private machine type has the same effect as deleting a cloud machine type, except the machines are removed and not terminated. Read more about deleting machine types here.