Model Servers

Model servers allow you to stand up and serve any models you create and train on Spell as a realtime HTTPS endpoint. Spell model servers are deployed using Kubernetes, providing self-healing, autoscaling, and zero-downtime rollouts.

Note

In order to launch model servers you will first need to initialize your serving cluster. To learn more refer to Serving Cluster Management.

Model servers are available on the Spell for Teams plan.

Model servers quickstart

Note

For an introduction to Spell model servers in code check out our CIFAR quickstart notebook.

Before you can create a model server, you first need to create a model. Models are first-class primitives in Spell. The "Models" tab in the sidebar lists all of the models currently registered to your organization, and you can click through to a model from that list to view its details:

Image of the Model details page.

A model has, at a minimum, a name (cifar in this example), a version (v1), and a set of resources (../keras_cifar10_trained_model.h5) from a run (runs/16) it is associated with. In this example, the associated run is from a hyperparameter search, so the details card links back to the search job it came from.

The easiest way to create a model is to use the spell model create CLI command:

$ spell model create cifar10:v1 runs/16

In order to deploy this model, you will then need to also provide a model server script. A model server script is Python script in the following format:

from spell.serving import BasePredictor

class Predictor(BasePredictor):
    def __init__(self):
        pass

    def predict(self, payload):
        pass

Any server instantiation code in the __init__ function will be run before the server begins accepting responses. Presumably, this is where you would load the model from disk, using e.g. keras.load.

The predict method is what is called at runtime. Spell handles serializing model input (in the simple case, from JSON to a dict passed to the payload parameter) and deserializing model output (in the simple case, back to JSON).

Once you have a model and a model server script, the final step is combining the two and actually deploying the model server by using spell server serve:

$ spell server serve \
    cifar10:v1 predictor.py

Where cifar10 is the model, v1 is the version, and predictor.py is the path to the model server script on local disk (note that this code assumes you've already stood up a default serving cluster; we're omitting this step here for simplicity).

Successfully running this command will stand up a new model sever instance, viewable through the "Model Servers" page in the Spell web console:

Image of the Model Server details page.

This page provides model server details, hardware metrics, user metrics, and logs. These details are also available in the Spell CLI.

You can test that the model server is healthy and works the way you expect by hitting it with e.g. curl:

$ curl -d '{"hello": "world"}' \
     -X POST \
     https://$REGION.$SPELL_ORG.spell.services/$SPELL_ORG/$MODEL_NAME/predict

Replacing the $REGION, $SPELL_ORG, and $MODEL_NAME parameters with the ones specific to your particular model server instance, and '{"hello": "world"}' with an example payload your model server understands. You can copy-paste these values to clipboard from the cURL Command field on the model server details page.

Creating model servers

Model servers are created using the spell server serve command. This command takes a model version and an entrypoint script (on your local machine).

$ spell server serve cifar:v1 predictor.py

Once started, this will create and schedule a Kubernetes deployment that hosts the model server instances. The files for the model will be available in the /model/cifar directory within the model server.

The model server entrypoint should have the following format:

from spell.serving import BasePredictor

class Predictor(BasePredictor):
    def __init__(self):
        pass

    def predict(self, payload):
        pass

The __init__ method will be run at the start of every server, and will complete before the server begins accepting incoming requests. This method should be used to load the model and run any expensive preprocessing that can be done before the request comes in to make the prediction itself as fast as possible.

Note

Model servers are multi-processed and distributed amongst multiple Kubernetes pods. Any modifications to the state in your Predictor will not be propagated to other instances of your Predictor.

The predict method will be called every time a new request is received, and should be used to run inference on the model.

Model servers take JSON input by default; the body of the request will be deserialized and passed as a dict to the payload argument (but see the section "Handling non-JSON requests" for an alternative code path). The return value can be of any of the following types:

  • Python str, bytes, float, int or UUID
  • A Numpy scalar or ndarray
  • A JSON-serializable dict which can contain UUIDs and Numpy scalars and ndarrays
  • A Starlette Response object

Note that Starlette is the Pythonic ASGI framework that Spell uses as our model server middleware.

Model servers have the following naming rules:

  1. All characters are alphanumeric, dot (.), dash (-), or underscore (_)
  2. The first character is alphanumeric
  3. Symbols do not repeat (e.g. --)

Assigning model servers to a node group

Model servers will be assigned to the default node group by default. To assign the model server to a different node group, pass that node group's name to the --node-group parameter:

$ spell server serve \
    --node-group t4 \
    cifar:v1 predictor.py

Refer to the page Serving Cluster Management for more information on node group creation and management.

Updating model servers

All components of a model server except its name can be updated using the spell server update command.

Spell model servers are tied to a specific git repository and commit. For example, suppose you are using the file serve.py as your entrypoint. To update the model server to the newest version of the repo, use:

$ spell server update --entrypoint serve.py

To update the model server to use the file serve_better.py instead, use:

$ spell server update --entrypoint serve_better.py

Refer to "Mounting code from a local repository" in the run overview for more information on how our git integration works (model servers work the same way, and support the same commands, that runs do).

Because Spell performs a Kubernetes rolling update under the hood, model server updates are zero-downtime. This means you can update your model server to a newer version—by, say, switching to a newer version of the model being served—without taking your service offline.

Stopping or deleting model servers

Model servers can be stopped using spell server stop. This will take the model server offline (unschedule it from the serving cluster) without deleting it entirely. The model server can be restarted later using the spell server start command.

To delete the model server forever, use the spell server rm command. If the server is currently running, you will need to either run spell server stop first or use the --force flag.

These actions are also possible in the Spell web console.

Managing model server dependencies

Like Spell runs and workspaces, Spell model servers support the installation of additional code packages using the --pip, --pip-req, --conda-file, and --apt flags. Environment variables can be configured using --env. To learn more about these flags, refer to the page What Is a Run.

Model servers always use the default Spell framework as their base image. Custom Docker images are not currently supported.

Managing model server mounts

Like Spell runs and workspaces, Spell model servers support resource mount using the --mount flag:

$ spell server serve \
    --mount uploads/config \
    --mount uploads/text:vocab \
    cifar10:example predictor.py

Note that unlike runs, mounts in model servers are restricted to the /mounts directory, and any specified destination is interpreted as being relative to /mounts. In this example, the contents of uploads/config would be available within the server from /mounts/config. The contents of uploads/text would be available as /mounts/vocab.

Model server mounts can be managed after creation using either the spell server update or the spell server mounts commands. To update all the mounts of a model server, you can use the spell server update command. For example, to update the above server to have both uploads/config and uploads/info:some_info, you would run

$ spell server update \
    --mount uploads/config \
    --mount uploads/info:some_info \
    cifar10

To add mounts to a server, you can use the spell server mounts add command, and to remove mounts from a server, you can use spell server mounts rm command.

Accessing model metadata within a model server

Some metadata about the models can be found on the self.model_info attribute of the Predictor. This is a list of namedtuples containing fields for the name and version of the model.

Managing server autoscaling

spell server serve allows tuning model server autoscaling and scheduling behavior. The following options are provided (most of these should be familiar to anyone who has worked with Kubernetes before):

  • --min-pods, --max-pods: Set the minimum and maximum number of pods that autoscaling should schedule.
  • --target-cpu-utilization: Set the average CPU usage at which to signal the autoscaler to schedule a new pod.
  • --target-requests-per-second: Set the average HTTP(S) requests per second, per pod, at which to signal the autoscaler to schedule a new pod. Can be used in combination with --target-cpu-utilization.
  • --cpu-request, --ram-request: Set the CPU/memory request values of the pods to adjust how they are scheduled on the cluster.
  • --cpu-limit, --ram-limit: Set CPU/memory limits on the pods to limit the amount of resources they consume on the node
  • --gpu-limit: Number of GPUs allocated to each pod. Fractional GPU limits are not supported.

These values can be tuned after the fact using spell server update or the Spell web console.

Model server scaling parameters view

Viewing model server logs

Any output to stdout or stderr within the model server will be logged. To retrieve the logs of a running model server, use either the spell server logs command or the server details page. Logs are broken up by pod.

Model server logs view

Viewing and writing model server metrics

Certain hardware metrics (CPU usage and memory per pod) and request metrics (request rate, request failures, and latency) are logged for you automatically. These are viewable in the metrics section of the model details page.

Model server hardware metrics view

You can also log user metrics from your model server. User metrics can be logged using the spell.serving.metrics.send_metric function in the Spell Python API:

from spell.serving import BasePredictor
import spell.serving.metrics as spell_metrics

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload):
        ...
        spell_metrics.send_metric("name", value, tag="tag")

name and value are required fields containing the metric name and (integer or floating point) value, respectively. The optional tag field may be used to create multiple different time series of data under the same metric name. A complete example is available at on Github.

Metrics are logged to the "User Metrics" section of the model server details page:

Model server user metrics view

(Advanced) Viewing the Grafana instance

Note

For an in-depth guide to using Prometheus and Grafana with Spell, refer to our blog posts "Model observability in Spell model servers with Prometheus and Grafana" and "Using Grafana for dashboarding and alerting using Grafana".

It is possible to explore model server metrics further using Grafana, a data visualization and monitoring engine installed in the model serving cluster. For convenience, a link to Grafana is provided on the model server summary page. Log in by using your cluster name for the user and spell kube-cluster get-grafana-password.

Model server grafana dashboard

Grafana can be queried using PromQL, the Prometheus query language.

To select all user metrics for a model server, use

{service="<model-serving-id>", spell_metric_name=~".+"}

To select a single user metric with the name <metric name>, use the query

{service="<model-serving-id>", spell_metric_name="<metric name>"}

PromQL also supports summing, averaging, and taking rates. For more information, consult the Prometheus documentation.

Grafana can also be used to monitor cluster-wide resource utilization. See the Serving Cluster docs.

(Advanced) Serving multiple models or no models

Model servers can serve multiple models. To launch a new model server with multiple models, you can use the spell server serve command and separate your models by spaces. For example:

$ spell server serve \
    --name multimodels \
    modelA:v1 modelB:v3 predictor.py

These models will be available in the model server at /model/modelA and /model/modelB. Spell currently does not support using multiple versions of the same model, and a --name must be provided for servers with multiple models.

To update all the models of a model server, you can use the spell server update command. For example, to update the above server to have both modelA:v1 and modelC:v2, you would run

$ spell server update \
    --model modelA:v1 \
    --model modelC:v2 \
    multimodels

To add models to a server, you can use the spell server models add command, and to remove models from a server, you can use spell server models rm command.

Similarly, model servers can chose to serve no models.

$ spell server serve \
    --name nomodels \
    predictor.py

(Advanced) Using health checks

Spell model servers have an additional /health health check endpoint. By default this endpoint responds with {"status": "ok"} if the server is able to respond at all:

$ curl https://$REGION.$SPELL_ORG.spell.services/$SPELL_ORG/$MODEL_NAME/health
{"status":"ok"}

You can customize this behavior by providing a health class method.

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload):
        ...

    def health(self):
        ...

To indicate that the endpoint is unhealthy, either raise an error (we recommend using the Starlette HTTPError class), or return a Starlette response with a non-2XX status code.

Any other valid response from this endpoint will be interpreted as a 200 OK. The health method can return any types the predict method can.

(Advanced) Handling non-JSON requests

As the section Creating model servers explains, Spell model serves default to JSON input. However, you can switch to using any request type supported by Starlette, our serving middleware, by adding a starlette.requests.Request type annotation to the payload field in predict:

from starlette.requests import Request

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload: Request):
        ...

    def health(self, request: Request):
        ...

Note that this feature is also available on the health endpoint.

Using this syntax, any parameter given the Request type annotation will receive the full Request.

Decorator syntax is also supported:

from spell.serving import with_full_request

class Predictor(BasePredictor):
    def __init__(self):
        ...

    @with_full_request(name="payload")
    def predict(self, payload):
        ...

    @with_full_request()
    def health(self, request):
        ...

By default, this decorator will pass the Request into a parameter named "request", but this can be overridden by providing a "name" argument to the decorator.

(Advanced) Launching background tasks

Because Spell model servers are designed to be queried online, request latency is typically important. But some work associated with handling a prediction request, such as logging to a model monitoring solution, aren't directly required for returning a prediction.

To remove this work from the critical path of the request, thereby decreasing the request latency, Spell model servers support Starlette background tasks. Background tasks allow you to schedule functions to be executed after the prediction response has been returned from the server. These tasks' functions can be either asynchronous or synchronous.

Note

For an example of this feature in action, see the serve_async.py entrypoint from our blog post "ML Observability with Spell and Arize".

This can be done using either type annotations or decorators.

Using type annotations:

from starlette.background import BackgroundTasks

async def some_task(foo, bar=1):
    ...

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload, tasks: BackgroundTasks):
        ...
        tasks.add_task(some_task, "my_foo", bar=3)

    def health(self, background: BackgroundTasks):
        ...
        background.add_task(some_task, "other_foo")

Using this syntax, any parameter given the BackgroundTasks type annotation will receive the BackgroundTasks object.

Alternatively, using decorators:

from spell.serving import with_background_tasks


async def some_task(foo, bar=1):
    ...

class Predictor(BasePredictor):
    def __init__(self):
        ...

    @with_background_tasks(name="bg_tasks")
    def predict(self, payload, bg_tasks):
        ...
        bg_tasks.add_task(some_task, "my_foo", bar=3)

    @with_background_tasks()
    def health(self, tasks):
        ...
        tasks.add_task(some_task, "my_foo", bar=3)

By default, this decorator will pass the BackgroundTasks into a parameter named tasks, but this can be overridden by providing a name argument to the decorator.

(Advanced) Async support

Spell model servers support asynchronous methods out of the box:

class Predictor(BasePredictor):
    def __init__(self):
        ...

    async def predict(self, payload):
        ...

    async def health(self, request):
        ...

Note

For an example of this feature in action, see the serve_async.py entrypoint from our blog post "ML Observability with Spell and Arize".

(Advanced) Adding Predictor configuration YAML

Configuration information can be provided to the Predictor as either a JSON or YAML file using the --config flag. For example, if you had a configuration named predict-config.yaml such as

use_feature_x:true
additional_info:
   output_type: bytes

Then you could modify the __init__ of the Predictor to accept this configuration by adding additional arguments

class Predictor(BasePredictor):
    def __init__(self, use_feature_x, additional_info=None):
        self.use_feature_x = use_feature_x # True
        self.additional_info = additional_info or {}
        # {“output_type”: “bytes”}
        ...

The serve command would now be

$ spell server serve \
    --config /path/to/predict-config.yaml \
    cifar:v1 predictor.py

(Advanced) Enabling server-side batching

Note

For a detailed explanation of this feature's appeal, including performance benchmarks, refer to our blog post "Online batching with Spell serving".

Model servers can use online batching to unlock significant improvements in throughput performance. With batching enabled, a proxy is put in front of your model server on each pod. The proxy waits for either a configurable amount of time or until the batch size reaches a configurable maximum number, batching incoming requests, before dispatching those requests to your model server.

Batching can be enabled for a model using the --enable-batching flag in the spell server serve command. This will use a maximum batch size of 100 and a request timeout of 500ms. These can be configured using the --max-batch-size and --request-timeout flags respectively.

Model servers without batching take a dict as input. Model servers with batching enabled receive a list of dict objects instead. To use a model server with batching enabled, update the signature of the predict function appropriately. For example:

from starlette.responses import PlainTextResponse
from spell.serving import BasePredictor

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload):
        return do_predict(item) for item in payload

The return type must be an iterable of types supported by the unbatched model server. To ensure that the predictions are routed to the correct requests the order of the predictions must match the order of the requests in the batch.

(Advanced) Using server-side batching with Starlette requests

The predict method in batch-enabled Predictors cannot use the full Starlette Request object. To access this object, you can add a prepare method to your predictor.

The prepare method has the same signature as the unbatched Predictor's predict method, including the ability to access the full Request object and spawn background tasks.

from starlette.responses import PlainTextResponse
from spell.serving import BasePredictor

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def prepare(self, payload):
        if not self.validate_payload(payload):
            return PlainTextResponse("Bad data sent!", status_code=400)
        return self.extract_data_from_payload(payload)

    def predict(self, payload):
        data = self.transform(data)
        return self.do_predict(data)
        ...

The prepare method should be used to do computationally inexpensive tasks such as reading and validating the request. If a Response object with an error status code is returned from prepare, the response will be immediately returned to the client, and the request will not be batched.

The prepare method must return a type serializable as a Python pickle.

Note

The balance of computational work done in the prepare versus the predict methods has significant impact on overall throughput. It should be considered a tunable parameter.