For machine learning teams training and deploying on the cloud, the key to keeping costs reasonable is making smart use of spot instances.
For those unfamiliar, spot instances are otherwise-idle cloud machines that cloud providers provide at a markdown of roughly 66%. This is massively cheaper than the default reserved instances. We've written about spot instances on this blog before. For example, in Reduce cloud GPU model training costs by 66% using spot instances, I provided gut checks on cost savings— the fact that using spot instances drops the cost of a 24-hour training job on a V100x4 (p3.8xlarge) from $293.76 to $88.
However, spot instances come with an extremely important catch. Unlike reserved instances, which can be yours for as long as you're willing to pay for, modulo hardware failure, the cloud provider can yank the cord on a spot instance at any time. Training on spot instances (or doing anything else, really) requires accepting the risk that this may occur and engineering around it (see Making model training scripts robust to spot interruptions for a primer on how this is done).
But how much risk is it really? The cloud providers don't say.
Since we introduced the spot instance feature in June 2019, Spell users have executed over 12,000 spot runs on AWS machines on our platform. Our unique position as an intermediary platform provider gives us a pretty good insight into spot instance behavior—one which we'll be sharing in this article!
Note that this article focuses on AWS specifically. Spell also supports GCP and Azure, both of which have their own (differently named) spot instance implementations. However, Spell's support for these platforms is much more recent than it is for AWS, so this blog post will focus on AWS exclusively.
The data and the model
To begin, I queried our internal read replica to fetch the run time and machine type of every run executed on a spot instance on an AWS instance on Spell. After some light cleanup, this netted 12,535 total runs.
Run times are exponentially distributed: short runs of a few minutes or even a few seconds (e.g. in the case of user error) are much more common than long-lived runs in the hours or longer. Additionally, AWS tries very hard not to interrupt spot instances too soon after they've been acquired, deeming it a bad user experience. The combination of these two factors means that, as of time of writing, only 420 Spell runs were interrupted before they could finish executing user code—a ratio of 3%.
Two other fun facts. The most common machine types used were base CPUs (c5.large), V100 GPUs (p3.2xlarge), and K80 GPUs (p2.xlarge), in that order. Also, our longest-lived spot instance job ran for 14 days.
I then modeled the data using a Kaplan–Meier estimator. This is a non-parametric statistical model which is capable of generating a survival function (a function estimating the probability that any individual run would run for a specified time without being interrupted) using right-censored data (runs which have already completed or failed by the time in question). Kaplan-Meier is the best known of a family of techniques which aim to answer this specific statistical question, many of which are implemented in the excellent lifelines Python library I used for this analysis.
Fitting the model to all of the data nets us the following curve:
In this plot, the dark orange line is the mean estimate of survival probability (the x-axis: 0 meaning definitely dead/interrupted, 1 meaning definitely alive), and the light orange lines are the boundaries of the 95 percent confidence interval. The y-axis is the time, in hours, since the run began; this chart ends at the 48-hour mark (e.g. a runtime of two days). The model naturally becomes less confident over time, as the number of instances running user code that was still executing after 24 hours or so is quite small.
According to this chart, the probability of holding onto a spot instance on AWS for 8 hours is approximately 90%. The probability of holding onto a spot instance for 16 hours is approximately 80%. After 24 hours, the probability dips to roughly 70%. Once you go over 24 hours, it's a toss-up.
However, some qualifications need to be made.
First of all, it's important to note that oftentimes the difficulty with working with spot instances isn't interrupts, it's not being able to reserve one at all. During busy times, AWS may pull most or even all spot instances off the market, resulting in very long queue times for acquiring one. Some very rare instance types are just extremely hard to get, period—we have had this experience with the K80x8 (p2.8xlarge) instance type in particular. This chart does not take into account machine request queueing time, which further affects wall clock time.
Second, the data used here is an agglomeration of every spot instance ever used on Spell. This means that it is not representative of a single instance type on AWS, or of AWS spot instances generally. Spell is an MLOps platform, and our users are data scientists and machine learning engineers executing compute jobs on hardware ranging from base CPUs to V100x8 GPU super-servers. This curve is representative of spot interrupt risk for an average machine learning project.
Third, the data runs from June 2019 to the present, and doesn't take into account any variation in spot behavior within that time.
Looking at individual instance types
In reality, spot interrupt behavior varies somewhat by instance type—e.g. the survival curve for a c5.large will look different from that of a p3.2xlarge. Let's take a look at that now, for a subset of instance types popular on Spell.
First up, our basic CPU instance, c5.large:
There's an interesting cliff right around the 8 hour mark, followed by a long period where we didn't see any interrupts at all.
Next, V100s (p3.2xlarge):
This survival curve is well-approximated by a linear curve. Your odds of holding on a V100 instance for 24 hours (our recommended maximum runtime) are good, around 80%.
Next, V100x4 (p3.8xlarge) instances.
This quad-V100 p3.8xlarge instance seems to be much harder to hold onto than the single-V100 p3.2xlarge. After six hours, the survival probability has already dipped to 60%.
Finally, let's look at what was, until November 2020, the most powerful GPU instance type currently generally available on AWS, the eight-GPU p3.16xlarge:
The behavior for this instance type looks to be very similar to that of the p3.8xlarge.
Overall, there seems to be quite a bit of variation in how individual instance types perform when it comes to spot instance interrupts, with beefy multi-GPU instances seeing interrupts much more often and more quickly than single-card ones do.
That concludes our analysis!
Looking to get the most out of your GPU cloud compute? Spell has spot support and auto-resumption built right into our SDK, making model training scripts robust to spot interruptions easy to build. To learn more, refer to our docs.