What Is a Workflow

Complex machine learning applications often require multi-stage pipelines (e.g., data loading, transforming, training, testing, iterating). Workflows allow you to manage these pipelines as a sequence of Spell runs, offering a lightweight alternative to tools like Airflow and Luigi for managing your model training pipelines.

Note

For an interactive, runnable tutorial showcasing workflows refer to the workflow tutorial in the spell/examples repo.

Anatomy of a workflow

Every workflow consists of one master run and one more more worker runs. The master run executes a workflow script responsible for control flow: that is, determining which worker runs should get executed when, and why. The worker runs then do all of the work required.

Consider the following state diagram:

In this example there is a single long-lived master run and a set of short-lived worker runs. The worker runs are arranged sequentially, with each subsequent run waiting for the previous run to finish before proceeding.

This is a typical arrangement for a simple machine learning pipeline. For example, the three steps might be "download data", "train a model", and "score the model on test data". Although you could potentially do all three parts in a single run, isolating individual steps in a workflow in this manner makes it easier to manage, edit, and reuse the individual steps of your pipeline much easier.

Furthermore, because the workflow script is typically written in Python, you have the full expressiveness of code for managing the dependencies between steps in your pipeline.

Creating a workflow script

The workflow script is the script that will be executed by the master run. The script can use written using either the Spell CLI or the Spell Python API. Here is an example of a basic workflow script written in Python:

# spellrun/examples/workflows/simple.py
import spell.client
client = spell.client.from_environment()

print(client.active_workflow)

r1 = client.runs.new(command="echo Hello World! > foo.txt")
r1.wait_status(*client.runs.FINAL)
r1.refresh()
if r1.status != client.runs.COMPLETE:
    raise OSError(f"failed at run {r.id}")

r2 = client.runs.new(
    command="cat /mnt/foo.txt",
    attached_resources={f"runs/{r1.id}/foo.txt": "/mnt/foo.txt"}
)
r2.wait_status(*client.runs.FINAL)
r2.refresh()
if r2.status != client.runs.COMPLETE:
    raise OSError(f"failed at run {r.id}")

print("Finished workflow!")

To learn more workflow script best practices and about the Python API check out the workflows tutorial in our examples repository.

Running a workflow

Use the spell workflow command to create and run a workflow. To execute the example workflow above you would run:

$ spell workflow \
    --github-repo https://github.com/spellrun/examples.git \
    "python workflows/simple.py"

The master run is still just a run under the hood, so the spell workflow command looks and feels like the spell run command. There is one important difference: the addition of the --repo flag.

You can use the --repo flag to pass named local git repositories present inside of the master run to the worker runs it manages. This allows you to easily parameterize the code environment in your worker runs right from the command line. To see this feature in action, check out the workflows tutorial in our examples repository.

Viewing a workflow

You can view the logs associated with the master run right from the command line using spell logs RUN_ID (where RUN_ID is the master run's run ID). spell ps, which lets you see currently in progress and recently exited runs, can also be very helpful for tracking the progress of your workflows.

However, the easiest place to view and manage your workflows is the workflows page in the web console:

You can use the web console to click through to constituent runs, view their logs, and stop or terminate them.

Interrupting a workflow

To interrupt a workflow, stop or kill its constituent runs. Note that terminating the master run will not stop any child runs that are already in progress—you will need to terminate these yourself. To learn more about the run termination APIs, refer to "Interrupting a run" in the run docs.