Integrating with Horovod

Users on the Spell for Teams plan can easily distribute deep learning training workloads across machines using our native horovod integration. Horovod is a distributed training SDK that can be used to distribute TensorFlow, Keras, or PyTorch models across multiple GPUs, which can drastically reduce training time.


For an interactive, runnable tutorial on distributed training using Horovod refer to our blog post: "Distributed model training using Horovod".

How it works

The Horovod API introduces new functions that you can wrap around and hook into your existing model training code, enabling Horovod control over the model training loop.

At launch time, Horovod creates a separate copy of the model's weights on every GPU. Every training step, each GPU draws a disjoint data batch and calculates a gradient update based on that batch (e.g. performs forward and backwards propogation) locally. The gradient updates are then synchronized across machines using a ring all-reduce algorithm, averaged, and applied to the local model weights simultaneously.

Thus while gradient calculations are performed asynchronously, gradient updates are made synchronously, ensuring that each copy of the weights matrix (one per GPU) is kept in agreement.

Horovod's ring reduce algorithm has performance characteristic superior to those of earlier implementations, such as the TensorFlow ParameterServerStrategy. At the same time Horovod is framework-independent and works with all of the major deep learning libraries.

Under the hook, Horovod uses the Open MPI and NVIDIA NCCL libraries to perform inter-GPU and inter-machine communication.

To learn more, check out "Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow".

Creating a distributed run


Before creating a distributed run, ensure that your training script has been updated to support it!

The spell run command has a --distributed N option that you can specify to distribute your run across N machines. N must be an integer greater than or equal to 1. For example, the following command will actually run the <command> on two separate T4 machine types simultaneously:

$ spell run --machine-type T4 --distributed 2 <command>

Spell automatically infers the number of GPUs per machine type, if applicable, and will spin up 1 process per GPU per machine. For example, the following command will spin up 8 processes on 4 separate machines to utilize all 32 GPUs:

$ spell run --machine-type T4x8 --distributed 4 <command>

Note that Horovod is not free, it comes with significant code complexity and performance overhead. For this reason we recommend scaling your machine learning training jobs vertically first. E.g. moving up from a V100 to a V100x2, a V100x2 to a V100x4, or a V100x4 to a V100x8, and using the same-machine data parallelization features built into the major frameworks.

In the common case, it only makes sense to begin scaling horizontally (enabling Horovod and training across machines) once you've exhausted your vertical scaling options.

Viewing a distributed run in the web console

Screenshot of a distributed run

The distributed run overview differs from regular Spell run overviews in a few key ways:

  • Hardware and model metrics are aggregated in a dropdown menu, and you can choose which machines to display metrics for by ID.
  • Only run outputs from the run designated the distributed primary are saved. This is in keeping with Horovod recommendations, which strongly advises persisting state (like model checkpoints) to disk on the first GPU in the cluster (the one with rank 0). This helps avoid data corruption. To learn more, refer to our distributed training tutorial.
  • Logs are now cluster-wide, with an updated new format to help clarify who's doing what:

ShellSession May 18, 2020, 11:58:21: running: [1,0]<stdout>:Train Epoch: 10 [7680/30000 (26%)] Loss: 0.126176 May 18, 2020, 11:58:21: running: [1,1]<stdout>:Train Epoch: 10 [7680/30000 (26%)] Loss: 0.156368 May 18, 2020, 11:58:21: running: [1,0]<stdout>:Train Epoch: 10 [8320/30000 (28%)] Loss: 0.300320 May 18, 2020, 11:58:21: running: [1,1]<stdout>:Train Epoch: 10 [8320/30000 (28%)] Loss: 0.208852 May 18, 2020, 11:58:21: running: [1,0]<stdout>:Train Epoch: 10 [8960/30000 (30%)] Loss: 0.231847 May 18, 2020, 11:58:21: running: [1,1]<stdout>:Train Epoch: 10 [8960/30000 (30%)] Loss: 0.270508

Note the introduction of <stdout> (or <stderr>), the stream type, and [n,k], the GPU ID, into the log prefix.

(Advanced) Creating a locally distributed run

Horovod supports multi-GPU training a single machine. The --distributed 1 option can be specified to leverage Open MPI and NCCL to start multiple processes on a single machine if there are multiple GPUs on the machine.

This allows you to use the same API for both single-machine and multi-machine training.

Using the MNIST example in the Horovod repository as an example:

$ git clone
$ cd horovod/examples
$ spell run \
    --machine-type T4x2 --distributed 1 \

(Advanced) Using the Horovod timeline

We automatically enable the Horovod timeline feature when you're doing a distributed run. This can help you analyze the performance of your training. The file is written to the run outputs as horovod_timeline.json. You can open that file in Chrome with chrome://tracing.

Screenshot of Horovod timeline tracing in action.

To learn more about the Horovod timeline refer to the page "Analyze Performance" in the Horovod docs.