Model Servers
Model servers allow you to stand up and serve any model you create and train on Spell as a realtime HTTPS endpoint. Spell model servers are deployed using Kubernetes, providing self-healing, autoscaling, and zero-downtime rollovers.
Note that in order to use Spell model servers you will first need to initialize your serving cluster. To learn more refer to the page Serving Cluster Management.
Model servers are available on the Spell for Teams plan.
Core concepts
Note
For a brisk introduction to Spell model servers in code check out our CIFAR quickstart notebook.
Before you can create a model server, you first need to create a model. Models are first-class primitives in Spell. The "Models" tab in the sidebar lists all of the models currently registered to your organization, and you can click through to a model from that list to view its details:
A model has, at a minimum, a name (cifar
in this example), a version (v1
), and a set of resources (../keras_cifar10_trained_model.h5
) from a run (runs/16
) it is associated with. In this example, the associated run is from a hyperparameter search, so the details card links back to the search job it came from.
The easiest way to create a run is to use the spell model create
CLI command:
$ spell model create runs/16
In order to deploy this model, you will then need to also provide a model server script. A model server script is Python script in the following format:
from spell.serving import BasePredictor
class Predictor(BasePredictor):
def __init__(self):
pass
def predict(self, payload):
pass
Any server instantiation code in the __init__
function will be run before the server begins accepting responses. Presumably, this is where you would load the model from disk, using e.g. keras.load
.
The predict
method is what is called at runtime. Spell handles serializing model input (in the simple case, from JSON to a dict
passed to the payload
parameter) and deserializing model output (in the simple case, back to JSON).
Once you have a model and a model server script, the final step is combining the two and actually deploying the model server by using spell server serve
:
$ spell server serve \
cifar10 predictor.py
Where cifar10
is the model and predictor.py
is the path to the model server script on local disk (note that this code assumes you've already stood up a default
serving cluster; we're omitting this step here for simplicity).
Successfully running this command will stand up a new model sever instance, viewable through the "Model Servers" page in the Spell web console:
This page provides model server details, hardware metrics, user metrics, and logs. These details are also available in the Spell CLI.
You can test that the model server is healthy and works the way you expect by hitting it with e.g. curl
:
$ curl -d '{"hello": "world"}' \
-X POST \
https://$REGION.$SPELL_ORG.spell.services/$SPELL_ORG/$MODEL_NAME/predict
Replacing the $REGION
, $SPELL_ORG
, and $MODEL_NAME
parameters with the ones specific to your particular model server instance, and '{"hello": "world"}'
with an example payload your model server understands. You can copy-paste these values to clipboard from the cURL Command
field on the model server details page.
Creating models
A model assigns a name
and a version
to a set of resources. Once model files have been registered to SpellFS, either as the output of a run or as an upload, they can be bundled into a model using the spell model create
command.
$ spell model create mymodel runs/1
By default, Spell will assign the model an incrementing version number (e.g. v1
, v2
, etc.), but a custom version string can be assigned using the --version
flag. Note that an auto-incremented version number is still assigned if a custom version is given, and both the custom version string and the auto-incremented version number are valid identifiers for the model version.
$ spell model create --version proof-of-concept mymodel runs/1
If the resource contains files that are not relevant to the model, you can also specify the desired files contained within the model using the --file
flag.
$ spell model create --file out/model.hd5 --file out/context.json mymodel runs/1
Models can also be created from the web. The easiest wway to do this is to use the Create Model
button on the Models
page in the web console:
Alternatively, to create a model from a completed run with output, use the Create Model
entry in the drop down menu while browsing the run. Or, to create a model from an upload, use the drop down menu in the resource browser.
Listing models
List all models using the spell model
command. This will list the model name, latest version, and other metadata about the model.
$ spell model
NAME LATEST VERSION CREATED CREATOR
cifar v1 7 days ago ***
bert_squad demo (v8) 11 days ago ***
roberta_squad v3 7 days ago ***
You can view all the versions of a specific model using spell model describe
.
$ spell model describe cifar
VERSION RESOURCE FILES CREATED CREATOR
v1 runs/3 saved_models/keras_cifar10_trained_model.h5:model.h5 7 days ago ***
This information and actions are also available in the Spell web console.
Deleting models
Models can be deleted using spell model rm
, or using the Spell web console. You can choose whether to delete a specific version of the model, or the model as a whole.
Creating model servers
Model servers are created using the spell server serve
command. This command takes a model version and an entrypoint script (on your local machine).
$ spell server serve cifar:v1 predictor.py
Once started, this will create and schedule a Kubernetes deployment that hosts the model server instances. The files for the model will be available in the /model/
directory within the model server.
The model server entrypoint should have the following format:
from spell.serving import BasePredictor
class Predictor(BasePredictor):
def __init__(self):
pass
def predict(self, payload):
pass
The __init__
method will be run at the start of every server, and will complete before the server begins accepting incoming requests. This method should be used to load the model and run any expensive preprocessing that can be done before the request comes in to make the prediction itself as fast as possible.
Note
Model servers are multi-processed and distributed amongst multiple Kubernetes pods. Any modifications to the state in your Predictor
will not be propagated to other instances of your Predictor
.
The predict
method will be called every time a new request is received, and should be used to run inference on the model.
Model servers take JSON input by default; the body of the request will be passed as a dict
to the payload
argument (but see the section "Handling non-JSON requests" for an alternative code path). The return value can be of any of the following types:
- A JSON-serializable
dict
bytes
str
- A Starlette Response object
Note that Starlette is the Pythonic ASGI framework that Spell uses as our model server middleware.
Model servers have the following naming rules:
- All characters are alphanumeric, dot (
.
), dash (-
), or underscore (_
) - The first character is alphanumeric
- Symbols do not repeat (e.g.
--
)
Assigning model servers to a node group
Model servers will be assigned to the default
node group by default. To assign the model server to a different node group, pass that node group's name to the --node-group
parameter:
$ spell server serve \
--node-group t4 \
cifar:v1 predictor.py
Refer to the page Serving Cluster Management for more information on node group creation and management.
Updating model servers
All components of a model server except its name can be updated using the spell server update
command.
Because Spell performs a Kubernetes rolling update under the hood, model server updates are zero-downtime. This means you can update your model server to a newer version—by, say, switching to a newer version of the model being served—without taking your service offline.
Stopping or deleting model servers
Model servers can be stopped using spell server stop
. This will take the model server offline (unschedule it from the serving cluster) without deleting it entirely. The model server can be restarted later using the spell server start
command.
To delete the model server forever, use the spell server rm
command. If the server is currently running, you will need to either run spell server stop
first or use the --force
flag.
These actions are also possible in the Spell web console.
Managing model server dependencies
Like Spell runs and workspaces, Spell model servers support the installation of additional code packages using the --pip
, --pip-req
, --conda-file
, and --apt
flags. Environment variables can be configured using --env
. To learn more about these flags, refer to the page What Is a Run.
Model servers always use the default
Spell framework as their base image. Custom Docker images are not currently supported.
Managing model server mounts
Like Spell runs and workspaces, Spell model servers support resource mount using the --mount
flag:
$ spell server serve \
--mount uploads/config \
--mount uploads/text:vocab \
cifar10:example predictor.py
Note that unlike runs, mounts in model servers are restricted to the /mounts
directory, and any specified destination is interpreted as being relative to /mounts
. In this example, the contents of uploads/config
would be available within the server from /mounts/config
. The contents of uploads/text
would be available as /mounts/vocab
.
Accessing model metadata within a model server
Some metadata about the current model can be found on the self.model_info
attribute of the Predictor
. This is a namedtuple
containing fields for the name
and version
of the model.
Managing server autoscaling
spell server serve
allows tuning model server autoscaling and scheduling behavior. The following options are provided (most of these should be familiar to anyone who has worked with Kubernetes before):
--min-pods
,--max-pods
: Set the minimum and maximum number of pods that autoscaling should schedule.--target-cpu-utilization
: Set the average CPU usage at which to signal the autoscaler to schedule a new pod.--target-requests-per-second
: Set the average HTTP(S) requests per second, per pod, at which to signal the autoscaler to schedule a new pod. Can be used in combination with--target-cpu-utilization
.--cpu-request
,--ram-request
: Set the CPU/memory request values of the pods to adjust how they are scheduled on the cluster.--cpu-limit
,--ram-limit
: Set CPU/memory limits on the pods to limit the amount of resources they consume on the node--gpu-limit
: Number of GPUs allocated to each pod. Fractional GPU limits are not supported.
These values can be tuned after the fact using spell server update
or the Spell web console.
Viewing model server logs
Any output to stdout
or stderr
within the model server will be logged. To retrieve the logs of a running model server, use either the spell server logs
command or the server details page. Logs are broken up by pod.
Viewing and writing model server metrics
Certain hardware metrics (CPU usage and memory per pod) and request metrics (request rate, request failures, and latency) are logged for you automatically. These are viewable in the metrics section of the model details page.
You can also log user metrics from your model server:
from spell.serving import BasePredictor
import spell.serving.metrics as spell_metrics
class Predictor(BasePredictor):
def __init__(self):
...
def predict(self, payload):
...
spell_metrics.send_metric("name", value)
These are logged to the "User Metrics" section of the model server details page.
User metrics in model servers are analogous to user metrics in Spell runs. For more details on how they work, refer to the Metrics page.
(Advanced) Using health checks
Spell model servers have an additional /health
health check endpoint. By default this endpoint responds with {"status": "ok"}
if the server is able to respond at all:
$ curl https://$REGION.$SPELL_ORG.spell.services/$SPELL_ORG/$MODEL_NAME/health
{"status":"ok"}
You can customize this behavior by providing a health
class method.
class Predictor(BasePredictor):
def __init__(self):
...
def predict(self, payload):
...
def health(self):
...
To indicate that the endpoint is unhealthy, either raise
an error (we recommend using the Starlette HTTPError class), or return a Starlette response with a non-2XX status code.
Any other valid response from this endpoint will be interpreted as a 200 OK
. The health
method can return any types the predict
method can.
(Advanced) Handling non-JSON requests
As the section Creating model servers explains, Spell model serves default to JSON input. However, you can switch to using any request type supported by Starlette, our serving middleware, by adding a starlette.requests.Request
type annotation to the payload
field in predict
:
from starlette.requests import Request
class Predictor(BasePredictor):
def __init__(self):
...
def predict(self, payload: Request):
...
def health(self, request: Request):
...
Note that this feature is also available on the health
endpoint.
Using this syntax, any parameter given the Request
type annotation will receive the full Request
.
Decorator syntax is also supported:
from spell.serving import with_full_request
class Predictor(BasePredictor):
def __init__(self):
...
@with_full_request(name="payload")
def predict(self, payload):
...
@with_full_request()
def health(self, request):
...
By default, this decorator will pass the Request
into a parameter named "request", but this can be overridden by providing a "name" argument to the decorator.
(Advanced) Launching background tasks
Because Spell model servers are designed to be queried online, request latency is typically important. But some work associated with handling a prediction request, such as logging to a model monitoring solution, aren't directly required for returning a prediction.
To remove this work from the critical path of the request, thereby decreasing the request latency, Spell model servers support Starlette background tasks. Background tasks allow you to schedule functions to be executed after the prediction response has been returned from the server. These tasks' functions can be either asynchronous or synchronous.
Note
For an example of this feature in action, see the serve_async.py entrypoint from our blog post "ML Observability with Spell and Arize".
This can be done using either type annotations or decorators.
Using type annotations:
from starlette.background import BackgroundTasks
async def some_task(foo, bar=1):
...
class Predictor(BasePredictor):
def __init__(self):
...
def predict(self, payload, tasks: BackgroundTasks):
...
tasks.add_task(some_task, “my_foo”, bar=3)
def health(self, background: BackgroundTasks):
...
background.add_task(some_task, “other_foo”)
Using this syntax, any parameter given the BackgroundTasks
type annotation will receive the BackgroundTasks
object.
Alternatively, using decorators:
from spell.serving import with_background_tasks
async def some_task(foo, bar=1):
...
class Predictor(BasePredictor):
def __init__(self):
...
@with_background_tasks(name="bg_tasks")
def predict(self, payload, bg_tasks):
...
bg_tasks.add_task(some_task, "my_foo", bar=3)
@with_background_tasks()
def health(self, tasks):
...
tasks.add_task(some_task, "my_foo", bar=3)
By default, this decorator will pass the BackgroundTasks
into a parameter named tasks
, but this can be overridden by providing a name
argument to the decorator.
(Advanced) Async support
Spell model servers support asynchronous methods out of the box:
class Predictor(BasePredictor):
def __init__(self):
...
async def predict(self, payload):
...
async def health(self, request):
...
Note
For an example of this feature in action, see the serve_async.py entrypoint from our blog post "ML Observability with Spell and Arize".
(Advanced) Adding Predictor configuration YAML
Configuration information can be provided to the Predictor as either a JSON or YAML file using the --config
flag. For example, if you had a configuration named predict-config.yaml
such as
use_feature_x:true
additional_info:
output_type: bytes
Then you could modify the __init__
of the Predictor to accept this configuration by adding additional arguments
class Predictor(BasePredictor):
def __init__(self, use_feature_x, additional_info=None):
self.use_feature_x = use_feature_x # True
self.additional_info = additional_info or {}
# {“output_type”: “bytes”}
...
The serve
command would now be
$ spell server serve \
--config /path/to/predict-config.yaml \
cifar:v1 predictor.py
(Advanced) Enabling server-side batching
Note
For a detailed explanation of this feature's appeal, including performance benchmarks, refer to our blog post "Online batching with Spell serving".
Model servers can use online batching to unlock significant improvements in throughput performance. With batching enabled, a proxy is put in front of your model server on each pod. The proxy waits for either a configurable amount of time or until the batch size reaches a configurable maximum number, batching incoming requests, before dispatching those requests to your model server.
Batching can be enabled for a model using the --enable-batching
flag in the spell server serve
command. This will use a maximum batch size of 100 and a request timeout of 500ms. These can be configured using the --max-batch-size
and --request-timeout
flags respectively.
Model servers without batching take a dict
as input. Model servers with batching enabled receive a list
of dict
objects instead. To use a model server with batching enabled, update the signature of the predict
function appropriately. For example:
from starlette.responses import PlainTextResponse
from spell.serving import BasePredictor
class Predictor(BasePredictor):
def __init__(self):
...
def predict(self, payload):
return do_predict(item) for item in payload
The return type must be an iterable of types supported by the unbatched model server. To ensure that the predictions are routed to the correct requests the order of the predictions must match the order of the requests in the batch.
(Advanced) Using server-side batching with Starlette requests
The predict
method in batch-enabled Predictors cannot use the full Starlette Request object. To access this object, you can add a prepare
method to your predictor.
The prepare
method has the same signature as the unbatched Predictor's predict
method, including the ability to access the full Request object and spawn background tasks.
from starlette.responses import PlainTextResponse
from spell.serving import BasePredictor
class Predictor(BasePredictor):
def __init__(self):
...
def prepare(self, payload):
if not self.validate_payload(payload):
return PlainTextResponse("Bad data sent!", status_code=400)
return self.extract_data_from_payload(payload)
def predict(self, payload):
data = self.transform(data)
return self.do_predict(data)
...
The prepare
method should be used to do computationally inexpensive tasks such as reading and validating the request. If a Response
object with an error status code is returned from prepare
, the response will be immediately returned to the client, and the request will not be batched.
The prepare
method must return a type serializable as a Python pickle.
Note
The balance of computational work done in the prepare
versus the predict
methods has significant impact on overall throughput. It should be considered a tunable parameter.