Integrating with TensorBoard

TensorBoard is a popular model visualization tool. TensorBoard is supported by both TensorFlow and PyTorch.

With Spell, you can easily use TensorBoard to visually examine your model training jobs.

Tensorboard example

Enabling TensorBoard in a run

TensorBoard operates by reading from an event file which you generate and write to over the course of a run. To enable the TensorBoard integration, set the value of the --tensorboard-dir flag in your spell run or spell hyper command to the directory containing these files. For example:

$ spell run --machine-type t4 \
    --github-url \
    --tensorboard-dir /spell/tensorboard/ \
    python models/

Your training script will need to use the TensorBoard API to write to this directory over the course of the run (see our demo training script for an example of how this works).

Viewing the TensorBoard for a run

You can access the TensorBoard for in-progress runs by visiting the run details page in the web console and clicking on the "Open" button in the section TensorBoard.

You can access the TensorBoard for runs that have finished executing in the same place. In this case, the run details card will contain a "Start" button instead:

Open tensorboard run

In-progress runs with TensorBoard enabled will keep a TensorBoard instance running in the background for the duration of the run. As a result, if the run finishes and Spell cleans the machine up while you are still on the TensorBoard page, you will lose access to TensorBoard.

For runs that have finished running, TensorBoard is run on a brand-new machine unrelated to the machine that executed the original run. The TensorBoard log files from the original machine, which were backed up to SpellFS, will be moved onto the new machine, a TensorBoard instance will be spun up, and you will be connected to it in the web.

You may execute this new TensorBoard run on any machine type you have defined in your organization (we recommend just using the basic cpu instance). The run will queue alongside any other runs requesting that machine type. It is has the same priority as any other run. It will not appear in the uncategorized runs list page, but it will appear amongst the runs queued on the cluster management page.

Hyperparameter search jobs launched using spell hyper similarly support the TensorBoard integration using the same --tensorboard-dir flag. For example:

$ spell hyper grid \
    --machine-type t4 \
    --param batch_size=16,32,64 \
    --param conv2_filters=32,64 \
    --github-url \
    --tensorboard-dir /spell/tensorboard/ \
    -- python models/ \
        --epochs 20 \
        --batch_size :batch_size: \
        --conv2_filters :conv2_filters:

Hyperparameter Searches launched using spell hyper support the TensorBoard integration using the same --tensorboard-dir flag. Similar to runs, the TensorBoard can be accessed via the TensorBoard section in the overview card on the hyperparameter search details page.

Open tensorboard hyper

Stopping a TensorBoard

The TensorBoard for an in-progress run cannot be stopped, as its lifetime is tied to the lifetime of the run. If you are on a run's TensorBoard page in the web console, and the run finishes and shuts down, you will lose access to TensorBoard automatically.

The TensorBoard for a finished run can be stopped by clicking on the "Stop" icon on the top right of the screen in TensorBoard. Additionally, note that all such TensorBoard runs are stopped automatically after 12 hours.

(Advanced) Comparing multiple runs using TensorBoard

Spell supports using TensorBoard to compare multiple runs.

To use this feature, navigate to the uncategorized runs or project runs page in the web console and select the runs you wish to compare. Then click on "Actions", then "Start TensorBoard".

TensorBoard for multiple runs

The resulting TensorBoard will contain data from all of the selected runs.

Multiple Tensorboard Runs