We’ve recently released a new feature at Spell: online batching. Online batching is a model serving technique which batches requests from multiple clients dynamically. Using online batching in conjunction with properly tuning a few batching parameters can unlock massive per-pod throughput performance for your models.
For example, the figure below indicates that an unbatched ResNet50 model server can handle no more than 40 requests per second in our load tests. With batching enabled and an appropriate request timeout parameter set, it can handle about 190 requests per second before the request latency exceeds 100 ms.
Batching multiple inference queries usually results in significantly improved performance because for most models, latency per inference decreases as batch size increases. Below is a figure from a 2017 article "An Analysis of Deep Neural Network Models for Practical Applications" displaying per-image inference time as a function of batch size for a variety of models:
However, if multiple clients are sending individual requests to a model server, it’s not easy to batch requests before sending them to a model, so requests are instead handled serially. This is an inefficient use of hardware, especially if GPUs are used to do inference.
When a model server has enabled batching, Spell constructs a single-process proxy in front of the model server. If a prepare method has been implemented in the Predictor, the proxy forwards the request to the prepare method in the Predictor. The return value is then serialized (using Python's pickle library) and returned to the proxy. If the response is an error, it is returned to the client immediately. Otherwise, the prepared response is unpickled and then placed into a queue. If no prepare method has been implemented, the JSON body of the request is deserialized in the proxy and placed onto a queue.
This is where batching kicks in. When either (1) the queue reaches a specified maximum size, or (2) a specified amount of time has passed since the request was received by the proxy, the contents of the queue are sent to the model server as a pickled list of objects. This batch is then unpickled and sent to the Predictor predict method. The predict method returns a list of predictions, and those predictions are returned to their respective clients. The maximum batch size and the request timeout can be configured using the --max-batch-size and --request-timeout parameters respectively on the spell server serve command.
We will demonstrate the effectiveness of Spell's online batching feature using a ResNet50 model. The code to follow with this example is from the spellml/examples repository.
Because we don't need any transfer learning to do the load test, we can simply download the pretrained model from Keras and save the file:
# make_resnet50.py from keras.applications.resnet50 import ResNet50 model = ResNet50(weights='imagenet') model.save("model.h5")
To create a model, we can first execute a run via spell run python make_resnet.py. Once the run is completed, we can create a model from it by running spell model create --file model.h5 resnet50 runs/<RUN_ID>. A standard, unbatched predictor for this model can be written as
# predictor.py from base64 import b64decode from io import BytesIO import os from keras.preprocessing.image import img_to_array from keras import backend as K from keras.applications.resnet50 import decode_predictions import numpy as np from PIL import Image import tensorflow as tf from spell.serving import BasePredictor MODEL_PATH = os.environ.get("MODEL_PATH", "./model.h5") class Predictor(BasePredictor): def __init__(self): # Allow GPU memory to be dynamically allocated to allow multiple model server processes config = tf.ConfigProto() config.gpu_options.allow_growth = True self.graph = tf.get_default_graph() # Construct a TensorFlow session to use for the model server with self.graph.as_default(): self.sess = tf.Session(config=config) K.set_session(self.sess) # Load model into memory self.model = tf.keras.models.load_model(MODEL_PATH) # Payload should have a "img" key with a base64-encoded image def predict(self, payload): image = b64decode(payload["img"]) x = self.transform_image(image) predictions = self.do_predict(x) return predictions def transform_image(self, image): # Resize and scale image image = Image.open(BytesIO(image)) img_resized = image.resize((224, 224), Image.ANTIALIAS) scaled_img = img_to_array(img_resized).astype("float32") / 255 # Wrap image as a np.array return np.expand_dims(scaled_img, axis=0) def do_predict(self, x): with self.graph.as_default(): pred = self.model.predict(x) classes = decode_predictions(pred) return [y for y in classes]
In this Predictor, the predict method will accept a single argument, payload, which will contain a field named img containing a base64-encoded image. To convert this Predictor into a batched Predictor, we modify the Predictor to accept a list of JSONs and return a list of class predictions. We can add another Predictor class in the same file and extend the unbatched Predictor class.
class BatchPredictor(Predictor): def predict(self, payload): imgs = [self.transform_image(b64decode(data[“img”])) for data in payload] x = np.array(imgs) predictions = self.do_predict(x) return [pred for pred in predictions]
In this batch-enabled Predictor, the prepare method will be created for you, and will do nothing but extract the JSON body of the HTTP request and transform it into a dictionary. We can optimize this further by putting a few lines of the transformation from base64-encoded string into an image into a prepare method. The division of computation between the prepare and predict methods should be considered a tunable parameter. From experimentation, putting the entire transform_image actually performed worse than the unbatched Predictor, but restricting the prepare function to only decoding the image into bytes performed well.
class BatchPredictor(Predictor): def prepare(self, payload): return b64decode(payload["img"]) def predict(self, payload): imgs = [self.transform_image(data) for data in payload] x = np.array(imgs) predictions = self.do_predict(x) return [pred for pred in predictions]
We deploy these models onto a Kubernetes node with a single V100 GPU on AWS. If you haven’t created that node group, you can run spell cluster node-group add --name v100 --instance-type p3.2xlarge. To create the model server using the unbatched Predictor, you can run:
$ spell server serve \ --pip 'h5py<3' \ --name resnet50 \ --node-group v100 \ --env MODEL_PATH=/model/model.h5 \ --classname Predictor \ resnet50:v1 \ predictor.py
And to create a model server using the batched predictor with a request timeout of 25 milliseconds, you can run:
$ spell server serve \ --pip 'h5py<3' \ --name resnet50 \ --node-group v100 \ --env MODEL_PATH=/model/model.h5 \ --classname BatchPredictor \ --request-timeout 25 \ resnet50:v1 \ predictor.py
(The h5py version restriction is required because at time of writing, there is a bug in the h5py library introduced in version 3 which prevents loading keras models in this way).
To load test these model servers, you can run the modelservers/resnet50/loadtest.py script in the spellml/examples repository. This script sends images to the model server at a given request rate, recording the latency of the requests, then slowly increases the rate until the latency exceeds a specified limit. Your local network connection might be a bottleneck at high request rates, so it’s recommended to use a Spell run to get more accurate and consistent results:
$ spell run \ --pip aiohttp \ --pip dataclasses \ --label loadtest \ -- \ python loadtest.py \ --hold-seconds 10 \ --procs 10 \ --url <URL_TO_YOUR_MODEL_SERVER> \ --name unbatched \ --out-dir ./loadtest \ --rates 10:1000:10 \ --latency-limit 1000 \ --img-path cat.jpeg $ spell run \ --pip aiohttp \ --pip dataclasses \ --label loadtest \ -- \ python loadtest.py \ --hold-seconds 10 \ --procs 10 \ --url <URL_TO_YOUR_BATCHED_MODEL_SERVER> \ --name batched-t25 \ --out-dir ./loadtest \ --rates 10:1000:10 \ --latency-limit 1000 \ --img-path cat.jpeg
You should get logs like
Warming and Calibrating... Testing until latency exceeds 1000ms... Est. Median Latency for 10.0 req/s: 26ms Est. Median Latency for 20.0 req/s: 26ms Est. Median Latency for 30.0 req/s: 26ms Est. Median Latency for 40.0 req/s: 26ms Est. Median Latency for 50.0 req/s: 748ms Got bad response! 503, message='Service Unavailable', url=URL('<URL_TO_YOUR_MODEL>') Got bad response! 503, message='Service Unavailable', url=URL('<URL_TO_YOUR_MODEL>') Got bad response! 503, message='Service Unavailable', url=URL('<URL_TO_YOUR_MODEL>') Got bad response! 503, message='Service Unavailable', url=URL('<URL_TO_YOUR_MODEL>) Est. Median Latency for 60.0 req/s: 2227ms Median Latency exceeded 1000ms. Ending trial. Got to 60.0 req/s Consolidating data... DONE!
Once the runs have completed, you can gather your latency logs using a Spell workspace
$ spell jupyter \ --lab \ --mount runs/<UNBATCHED_RUN_ID>/loadtest/unbatched/consolidated.csv:unbatched.csv \ --mount runs/<BATCHED_RUN_ID>/loadtest/batched-t25/consolidated.csv:batched-t25.csv \ --github-url https://github.com/spellml/examples \ loadtest
The notebook to visualize your results is available at modelservers/resnet50/loadtest.ipynb. In the experiment displayed below, we ran both unbatched and batched model servers with request timeouts of 200, 100, 50, and 25 milliseconds.
The shaded region on this plot represents the 20th to 80th quantiles. As you can see on this plot, a request timeout of 200 milliseconds appears to be in many ways worse than having no batching, but a request timeout of 25 milliseconds performs significantly better.
Additionally, you can use the --num-processes parameter to specify the number of processes running your model server. When a node group is used with available GPUs and the --gpu-limit parameter is not set, the default number of GPUs available to the model server pod (the GPU limit) is 1. The default number of processes for model servers on such node groups is equal to the GPU limit, so in the examples above, our model server was running on 1 worker process with another process for the proxy. If we use the --num-processes parameter on our servers to bump the number of model server worker processes to 2 we get even better performance.
Batching can enable significant per-pod throughput performance improvements on model servers, but requires some experimentation of four parameters: the maximum batch size, request timeout, number of processes, and the distribution of computation between the prepare and the predict methods.