Integrating with TensorBoard
With Spell, you can easily use TensorBoard to visually examine your model training jobs.
Enabling TensorBoard in a run
TensorBoard operates by reading from an event file which you generate and write to over the course of a run. To enable the TensorBoard integration, set the value of the
--tensorboard-dir flag in your
spell run or
spell hyper command to the directory containing these files. For example:
$ spell run --machine-type t4 \ --github-url https://github.com/spellml/cnn-cifar10.git \ --tensorboard-dir /spell/tensorboard/ \ python models/train.py
Your training script will need to use the TensorBoard API to write to this directory over the course of the run (see our demo training script for an example of how this works).
Viewing the TensorBoard for a run
You can access the TensorBoard for in-progress runs by visiting the run details page in the web console and clicking on the "Open" button in the section
You can access the TensorBoard for runs that have finished executing in the same place. In this case, the run details card will contain a "Start" button instead:
In-progress runs with TensorBoard enabled will keep a TensorBoard instance running in the background for the duration of the run. As a result, if the run finishes and Spell cleans the machine up while you are still on the TensorBoard page, you will lose access to TensorBoard.
For runs that have finished running, TensorBoard is run on a brand-new machine unrelated to the machine that executed the original run. The TensorBoard log files from the original machine, which were backed up to SpellFS, will be moved onto the new machine, a TensorBoard instance will be spun up, and you will be connected to it in the web.
You may execute this new TensorBoard run on any machine type you have defined in your organization (we recommend just using the basic
cpu instance). The run will queue alongside any other runs requesting that machine type. It is has the same priority as any other run. It will not appear in the uncategorized runs list page, but it will appear amongst the runs queued on the cluster management page.
Enabling TensorBoard in a hyperparameter search
Hyperparameter search jobs launched using
spell hyper similarly support the TensorBoard integration using the same
--tensorboard-dir flag. For example:
$ spell hyper grid \ --machine-type t4 \ --param batch_size=16,32,64 \ --param conv2_filters=32,64 \ --github-url https://github.com/spellml/cnn-cifar10.git \ --tensorboard-dir /spell/tensorboard/ \ -- python models/train.py \ --epochs 20 \ --batch_size :batch_size: \ --conv2_filters :conv2_filters:
Viewing the TensorBoard for a hyperparameter search
Hyperparameter Searches launched using
spell hyper support the TensorBoard integration using the same
--tensorboard-dir flag. Similar to runs, the TensorBoard can be accessed via the
TensorBoard section in the overview card on the hyperparameter search details page.
Stopping a TensorBoard
The TensorBoard for an in-progress run cannot be stopped, as its lifetime is tied to the lifetime of the run. If you are on a run's TensorBoard page in the web console, and the run finishes and shuts down, you will lose access to TensorBoard automatically.
The TensorBoard for a finished run can be stopped by clicking on the "Stop" icon on the top right of the screen in TensorBoard. Additionally, note that all such TensorBoard runs are stopped automatically after 12 hours.
(Advanced) Comparing multiple runs using TensorBoard
Spell supports using TensorBoard to compare multiple runs.
To use this feature, navigate to the uncategorized runs or project runs page in the web console and select the runs you wish to compare. Then click on "Actions", then "Start TensorBoard".
The resulting TensorBoard will contain data from all of the selected runs.