Model observability in Spell model servers with Prometheus and Grafana

One of the first things you think about as you transition a model server from experimentation to the production phase is monitoring. This is complicated by the realities of machine learning: deep learning models are among the most complex algorithms in common use today. Instrumenting your model server so that you can see and understand all aspects of your service—response time, prediction accuracy, prediction confidence, input data trends—is a quite complicated task.

Luckily Spell model serving users can rely on two tried-and-tested open source tools for these needs: Prometheus and Grafana. Every Spell model serving cluster is provisioned with a Prometheus server and a Grafana dashboard as part of its deployment; leveraging these technologies is key to building observable, instrumented model servers on Spell.

In this blog post we will learn how to use this stack to query model server performance, focusing on understanding how to write, log, and query Prometheus metrics using Grafana on Spell. In a future blog post we will additionally cover Grafana dashboards and Grafana alerts.

Note that this is an advanced topic. This blog post assumes you are familiar with how Spell model servers work!

Prometheus and Grafana—the basics

Before we can leverage these technologies, we must first understand a little bit about how they work. Note that these are very complicated and powerful tools, so we'll be skimming over a lot of details here.

The backend component is Prometheus. Prometheus is an open source NoSQL database. More specifically, it is what is called a time-series database—a database designed for sequential data with timestamps attached. This type of data has a specific compact format (usually a timestamp key and integer or float value) and a set of common query patterns (percentiles, etcetera). The Prometheus backend takes advantage of these characteristics to make storage efficient and querying fast.

Prometheus was developed internally at Soundcloud from circa 2012, then open sourced in 2015. At present it is a graduated project of the Cloud Native Computing Foundation. Prometheus is optimized for one very specific and widespread use case: storing and querying aggregated deployment metrics over time. Examples of data commonly stored in Prometheus: total requests, requests over time, request latency, CPU utilization, healthcheck status, deployment uptime. This data is used by engineering organizations for service monitoring, alerting (alongside a pager service, e.g. Pagerduty), and debugging (alongside a logs service, a la ELK).

Prometheus is in many ways an iteration on an idea that started with the open-sourcing of Statsd in 2011. Statsd and Prometheus are still the two most commonly-used open source solutions for aggregated metrics logging. This blog post discusses the differences between the two, and for the mathematically inclined there is an excellent StrangeLoop talk from 2013 that explains how metric aggregation works and why it’s compelling.

Prometheus works by scraping a so-called metrics endpoint (typically on the /metrics URL path) at set intervals (15s is the default; at Spell we use a default of 1s instead). Each service Prometheus is monitoring is expected to serve data for it to ingest, and the data is expected to be provided in a simple human-readable newline-delimited format:

go_gc_duration_seconds{quantile="0"} 7.48e-05
go_gc_duration_seconds{quantile="0.25"} 0.0001661
go_gc_duration_seconds{quantile="0.5"} 0.000329
go_gc_duration_seconds{quantile="0.75"} 0.000537
go_gc_duration_seconds{quantile="1"} 0.003741
go_gc_duration_seconds_sum 0.0249316
go_gc_duration_seconds_count 51

The front-end component is Grafana. Grafana is a visualization and dashboarding suite, designed to fit a variety of data sources but with especially good built-in support for Prometheus in particular. For readers familiar with logging infrastructure: Grafana is to Prometheus what Kibana is to ElasticSearch.

Grafana (unlike, for the most part, Prometheus) is an interactive tool, designed to be played with in a web browser. The two most fundamental pages its web UI exposes are the Explore page and the Dashboards page. The Explore page lets you write free-form queries for data sources against your Prometheus database—using PromQL, Prometheus's specialized query language—and visualize the results as a graph:

The Dashboards page, meanwhile, is Grafana's "killer app". This page lets you build interactive dashboards, using various widget the Grafana SDK provides you with and data hosted in Prometheus:

To learn more, I recommend trying out the Grafana Fundamentals quickstart. This tutorial walks through standing up an example Prometheus and Grafana instance in your local environment (using Docker) and demonstrates some example workflows using the software.

Logging into Grafana for the first time

The first step to using the model servers feature of Spell is creating a model serving Kubernetes cluster within your Spell cluster. This is done via a spell kube-cluster create command. This command will create a default node group (a node group on AWS EKS, or a node pool in GCP GKE), and it will also launch a few services on that node group that run on that group. Among these services are a Prometheus server instance and a Grafana instance.

Model serving clusters can be deployed in private or public mode. Clusters deployed in public mode—the default—can access the Grafana dashboard on the web.

At this point, to follow along in code, make sure that you have (1) created a model serving cluster in your Spell organization and (2) a model server is up and running on that cluster. For a simple example model server that you can spin up and run yourself, check out the Using Model Servers example in the Spell quickstart.

When you visit the details page for a specific model server, you'll see a link to your Grafana instance under Custom Metric Visualization (Grafana):

This URL can also be fetched using the CLI (using spell server describe or spell server grafana). It will always point to a path in the domain. Clicking on it will take you to the Grafana login pane:

The login username will be your cluster name. The password is a secret specific to your cluster—you can set your own password at model serving cluster create time, or choose to have Spell generate a random one for you. Regardless of which route you go, you can retrieve the password by running spell kube-cluster get-grafana-password (or reset it using spell kube-cluster reset-grafana-password).

It's important to note that public serving clusters expose all of their endpoints—model server endpoints and Grafana dashboard alike—to the public Internet. In theory, anyone with the correct link could attempt to hit these endpoints, so keeping your password secure and your server endpoint authenticated is your responsibility.

Model serving clusters deployed in private mode have the advantage that none of these endpoints are visible on the public Internet, as network access will now be locked down to machines within your cloud VPC. However, this also means that you will no longer be able to access your Grafana instance easily either! If you are using a private model serving cluster and need access to Grafana, you will need to do significant additional work configuring the Spell VPC to enable access from your developer machines.

The blog post "Model server authentication patterns on Spell" discusses this and other tradeoffs between public and private model serving clusters in more detail. Refer to that article to learn more.

Writing some example data to Prometheus

Now that we know how to access our model serving cluster's Grafana instance, we are ready to send some metrics to Prometheus that we can query with Grafana.

The Spell Python SDK has two separate metrics APIs.

The simpler of these two APIs is the spell.metrics module, which exposes a send_metric function. This API is used to log Spell metrics, and has nothing to do with the Spell Grafana integration. This is one of the core features of Spell, but it can only be used in Spell runs (not model servers) for reasons we’ll get to shortly. Refer to the "Metrics" page in the Spell docs to learn more.

The API for interacting with Spell Prometheus is the spell.serving.metrics module. As you might infer from its name, this is a serving-only API. It exposes its own version of the send_metric function, and it's this version of this function that you use to log your model server's data with Prometheus.

Consider the following trivial model server entrypoint script:

from datetime import datetime
import json

from spell.serving import BasePredictor
import spell.serving.metrics as m

class Predictor(BasePredictor):
    """ Simple metrics example. """
    def __init__(self):

    def predict(self, payload):
        now =
        minutes_since_hour = (
            now - now.replace(minute=0, second=0, microsecond=0)
        ).total_seconds() / 60.0

        # send a metric with a tag
        m.send_metric("minutes_since_hour", minutes_since_hour, tag=str(now.hour))

        return json.dumps({"response": minutes_since_hour})

Every time this model server is hit, it will use the Python datetime module to calculate the current minutely timestamp, synchronously log that data in Prometheus. It then responds with a JSON copy of this data to the end user.

The first time this model server is called, the Spell client will, upon calling send_metric, add this new minutes_since_hour metric to the list of metrics advertised on the model server's /metrics endpoint. This newly created metric will be what Prometheus refers to as a gauge: basically, a metric value that can go up or down (there is an even simpler counter type that must always go up, and much more complicated histogram and statistic types that do something completely different). Every subsequent call to the model server will update this entry to a new value.

Recall how Prometheus works: by scraping the /metrics endpoint for fresh data once per second. At scrape time, whatever value was most recently written to memory by send_metric is what will be logged to Prometheus.

It's very important to understand that not every call to spell.serving.metrics.send_metric will show up in Prometheus! If this model server instance several requests within that one-second window, only the last write will have its data stored in the database. Similarly, if no requests are made within that window, whatever data existed in memory will be reused again. Decoupling reading from writing in this manner is what allows Prometheus to be an effective tool when request volume gets to be very high.

Compare this with the spell.metrics.send_metric API. spell.metrics.send_metric sends every value you log to Spell. But it has an important caveat: in order to limit network burden, this function is rate-limited to one value per second per metric name per model server (and 50 metric names max).

As a result, if model servers could use the spell.metrics.send_metric API, they would be very likely to lose writes! Prometheus, by contrast, has no such limit. Model servers can serve a large amount of traffic and rely on the spell.serving.metrics.send_metric API to do what you want.

Reading that data back from Grafana

Now that we know how to use the Spell Python API to write some data to Prometheus, the next step is reading that data back out using Grafana.

Spell metrics are stored in Prometheus using a spell:custom prefix (to avoid naming collisions). So our example metric, minutes_since_hour, will be stored in Prometheus as a time series with the name spell:custom:minutes_since_hour.

After hitting the model server with curl a few times, you ahead and plug this query into Grafana using the "Explore" tab:

Huh, if we're querying for one metric, why are we getting a table with this many results back?

Every metric in Prometheus accepts an arbitrary number of key-value pairs called labels. When you send a metric to it using the spell.serving.metrics.send_metric API, a set of labels describing the metric's creator are automatically attached to it. Every unique combination of such labels is stored as a separate time series in the Prometheus database service.

It's important to remember that model servers are highly concurrent. Every model server instance is load balanced across 1 or more (n) different Kubernetes pods; every pod runs 1 or more (k) different worker processes. As a result, your one metric will break down into n x k different time series! Actually n x k x m, where m is the number of tags (something we’ll get to in a second).

In order to interpret these results, we start by looking at the service field, which contains the model server ID. The model server ID is a (unique, monotonically increasing) integer ID that each model server instance is assigned at create time. The model server ID is part of the Custom Metric Visualization (Grafana) field on the model server summary page in the web console; you can also get it using the Spell CLI (by running spell server describe) or the Spell Python API (by inspecting the ModelServer object returned by spell.servers.get()).

Our model server of interest has the ID 10480, and hence the service value model-serving-10480. In this example case, there is another server running the same code with the ID 10379. Grafana queries are built using PromQL, Prometheus's NoSQL query language. To filter by label value, append {label="value"} to the PromQL query:

The next highest level of scaling is the pod. Recall that every Spell model server is executed across one or more (n) Kubernetes pods. Pods are how Kubernetes manages scaling and self-healing. Pods may be created by changes in the model server scaling configuration, autoscaling events, rollover events, pod restarts, or self-healing events. Pods may be destroyed by similar means.

Thus, the exact number of pods Prometheus is aware of will be a detail of the operational history of the model server. Model servers that are updated frequently or autoscale frequently will report significantly more pods than servers that are "static" and don't change. In this trivial example, the model server has had one and only one pod for the entirety of its history: model-serving-10480-6d8cb8888f-sgrmc.

The last level of scaling is the pid. Every model server pod is responsible for one model server container (what ultimately handles the request), plus some additional containers that handle e.g. hardware metrics logging that we won't discuss too much here. That model server container receives load-balanced traffic from Kubernetes, which it itself load-balances that traffic across some number of worker processes. This is a concurrency paradigm known as multiprocessing, and it's a standard optimization for highly scalable systems. Each worker process is a fully-fledged UNIX process in its own right. Every process in UNIX has its own integer-valued process ID (PID), and that value is what is reported by the pid label.

Finally, there is one more label you need to be aware of: spell_tag. The spell_tag is an optional (default empty) label whose contents may be set by using the tag parameter on spell.servers.send_metric. Looking at the code again:

m.send_metric("minutes_since_hour", minutes_since_hour, tag=str(now.hour))

So in our example case, tag (and thus spell_tag) is set to the current hour, as reported by datetime, the moment at which the server is called. We expect this to be a value between 1 and 24—and if we inspect the contents of the spell_tag field in Grafana, we see that is indeed the case!

Slicing and dicing these label parameters is how you select the subset of your logged metrics that you are interested in. As a trivial example, consider the following PromQL query, which returns the largest metric value that got logged for this server (it returns 60):

max by (handler) (

The PromQL query language is expressive but also fairly complex, and takes some getting used to. For a thousand-mile view of how to use it, I recommend reading the Querying Basics page in the Prometheus AP docs.

Writing more complex data to Prometheus

Prometheus has a concept of metric types. Recall that the spell.servers.send_metric function assumes that the data you are sending is of a specific type: one called a gauge.

In truth, a gauge is rarely what you want. Most of the interesting attributes of a model you are monitoring, things like model confidence and or model server response time, are statistics you want to aggregate across multiple requests, not sample sequentially.

Prometheus supports three other metric types: counters, histograms, and summaries. The latter two of these are particularly useful, as they allow you to calculate, log, and then later query streaming statistics over input data over time.

You can log these metric types within a Spell model server by using the Prometheus client library directly, prometheus/client_python (GitHub link), directly. A properly-parameterized version of this library is exposed within the spell.servers module namespace under the name prometheus.

The following example script logs a pair of histogram metrics every time the model endpoint is hit: inference_time and prediction_confidence.

import json
import random
import time
import os

from spell.serving import BasePredictor
import spell.serving.metrics as m

class Predictor(BasePredictor):
    """More complex metrics example. """
    def __init__(self): = str(os.getpid())
        self.inference_time_hist = m.prometheus.Histogram(
            'inference_time', 'Model inference time',
            labelnames=['pid'], labelvalues=[],
        self.prediction_confidence_hist = m.prometheus.Histogram(
            'Model (self-reported) prediction confidence',
            buckets=[round(0.1 * x, 1) for x in range(0, 11)],
            labelnames=['pid'], labelvalues=[],

    def predict(self, payload):
        prediction_time = 0.5 + ((random.random() - 0.5) * 0.2)

        with self.inference_time_hist.time():
            # simulate an inference event, randomly takes 0.4 to 0.6 seconds

        # simulate model confidence, randomly in the range 0.7 and 0.9
        prediction_confidence = 0.8 + ((random.random() - 0.5) * 0.2)

        return json.dumps({"response": payload})

inference_time is a m.prometheus.Histogram that reports simulated model inference time (always in the range (0.4, 0.6)). Because this metric does not set buckets, the client will use the default buckets (which Prometheus advertises is appropriate for request-type metrics): .005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF. This metric uses the timing wrapper function built into this client library, Histogram.time, to do the actual observation.

prediction_confidence is a m.prometheus.Histogram that reports simulated model prediction confidence (always in the range (0.7, 0.9)). This metric does not use the default buckets: instead it passes a decile range to the buckets parameter, e.g. [0.0, 0.1, ..., 1].

Additional labels may be passed to the metric using the combination of labelnames and labelvalues parameters. Recall that Prometheus metrics created using Spell's spell.serving.metrics.send_metric API will set service, pod, and pid labels. Prometheus metrics created using the Prometheus Python client will still set service and pod values, but they will be missing the pid. Luckily it is easy to find, and set, the process PID yourself, using and these two flags.

After launching this model server and simulating some traffic to it, we can query Grafana to see some results. Let's look at inference_time. Every histogram yields *_count, *_bucket, and *_sum time series. Querying it via inference_time_bucket{job="model-serving-10487"}:

You should immediately notice that Prometheus histograms are cumulative: each additional value that gets logged to a bucket increments the total value of that bucket by one. This makes sense when you remember that Prometheus is a streaming logger!

Side note: you may notice that some model server processes run much hotter than others. This is expected: load balancing across processes is performed by the OS, which in this case doesn't use naive round-robin but instead prefers one process over the others. See the following GitHub issue for more information.

The quantiles themselves are enumerated using the le label. le stands for "less than or equal" to. Buckets in Prometheus always start at zero: so in this case we the buckets are [0, 0.005], [0, 0.01], etcetera. As a result, each subsequent bucket counts those of the buckets before it, plus its own, up to the last bucket ([0, +inf] in this case), which is simply a total.

To get the data we want—the actual percentage of requests within a bucket quantile within a certain interval—we need to apply a PromQL transform to this data. For histograms, this is typically done using the histogram_quantile function. This is discussed in detail in the Prometheus page on histograms. Here it is applied to our example model (with a 5 minute sliding window):

As you can see this model server experienced some downtime (due to a rolling code update), during which no values got logged. Times when data did get logged, we see that inference_time was within the [0.01, 0.5] bucket roughly half the time (which is what we expect—recall that our simulated inference time is a value in the range (0.4, 0.6) centered on 0.5).

Querying prediction confidence (prediction_confidence_bucket) works similarly:


That concludes our exploration of Prometheus and Grafana. Hopefully you now have a firm idea of how these technologies work, both in general and with Spell model servers in particular, and are ready to start using it to bring observability to your own machine learning model deployments.

In future articles we will explore using Grafana for dashboarding and alerting, as well as some of the system-level metrics that Grafana gives you access to and how to use them.

In the meantime, any Spell for Teams users that have any questions about this feature should reach out to us for more information. Happy training!

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.