Spell was founded in January 2017, a lifetime ago in the ML tooling ecosystem. At that time, the now-influential blog post introducing one of the earliest end-to-end machine learning systems, Uber Michelangelo, was still about a year away. AWS SageMaker (by virtue of its AWS built-in status, something of a reference point for ML systems everywhere) was not to be released until November 2017. P2 instances backed by K80 GPUs, the first AWS AMI specifically targeted at deep learning, had only just reached GA.
If you were a small or mid-sized company looking to build a machine learning platform in early 2017, you were basically stuck rolling your own software and hardware. Most companies found doing this in any centralized way untenable, so deep learning at companies not named "Google" or "Facebook" in practice meant tethering yourself to a physical box under your desk. These super-rigs are expensive: Andrej Karparthy’s rig at the time clocked in at ~$6k$. As a result, a lot of engineering time that could have been spent developing models went instead towards watching GPU market prices, finding and sharing optimizing specs, and dealing with the works-on-my-machine issues that inevitably crop up when every engineer has a custom environment running on a differently-speced machine.
Today, in mid-2020, things are very different. AWS is now three generations deep into its optimized machine learning AMIs; the learnings from three years of running and optimizing the running of GPU compute clusters have been passed down to end users in the form of cost savings and improved stability. There is now a constellation of startup and cloud vendor ML platform offerings vastly superior to what was available in circa 2017 SageMaker (Spell is one of them!).
Nevertheless, nearly every company we’ve talked to still largely prototypes their ML models on their local machines.
In fact, some still manage their own GPU servers as well!
Adoption of cloud compute for CPU workloads is basically universal at this point. The argument for moving GPU jobs to the cloud is no less convincing. Developing models on local GPU incurs the devops overhead of machine and/or cluster management, negatively impacts model reproducibility, and implicitly limits your capacity to however many GPUs you have. Developing models on the cloud, on the other hand, turns machine management into a single API call, makes model reproducibility a cinch (your p2.xlarge == my p2.xlarge), and lets you scale GPU usage from zero to wallet-shredder whenever you need it.
So what gives? Why isn’t the entire industry running on cloud GPUs yet?
The answer is simple: cost.
Cloud GPU prices are down, but not by enough for all use cases
To understand the relative costs of cloud and local GPUs, let’s look at a benchmark.
At time of writing, the most commonly used GPU on AWS is the NVIDIA V100. This is a specialty card designed for high precision HPC workloads in data centers, making it mind bogglingly expensive -- a single V100 goes for ~$8000.
Let’s compare the V100 with the best graphics card currently on the consumer market, the NVIDIA 2080 Ti. According to PCPartPicker, as of time of writing, the 2080 Ti retails for around ~$1200. A thorough benchmark by the folks at Lambda Labs found that the 2080 Ti gets 80% of the performance of a V100 on typical ML training workloads. In other words, the graphics card cloud vendors are using is over five times more expensive pound-for-pound than their consumer equivalents.
Of course, AWS cannot use a consumer card in their data center. I am sure they’re getting huge volume discounts from NVIDIA, but they’re still paying a premium (relative to direct-to-consumer cards) somewhere well north of 200%.
These costs are ultimately passed down to the consumer.
At time of writing, the going rate for a p3.2xlarge instance (backed by a single V100 GPU) on AWS is $3.06/hour for a reserved instance and $0.918/hr for spot. Using the 2080 Ti as a benchmark, we can say that a desktop machine with specs and performance in the ballpark of the p3.2xlarge can be had for ~$2000.
If we do the math (2000 / 0.918 * 1.25), we find that it takes 2725 hours of training time before the cost of using a p3.2xlarge spot starts to exceed the cost of building your own machine.
This might sound like a lot, but it’s really not. 2725 hours is 115 days. Assuming you’re training a model on the machine 33% of the time, you will amortize the cost of buying your own machine within the year. Even assuming you’re training something just 15% of the time, it would still only take a bit over two years before the local box is "worth it".
And that's using spot instances! If you're using reserved instances, which are typically 3x as expensive, the cost amortizes in months.
After going through these calculations, most machine learning teams go ahead and build the deep learning box. 📦
How industry best practices have evolved
Most machine learning teams we talk to get the best of both worlds with a hybrid strategy: using local GPUs and cloud GPUs simultaneously.
The greatest weakness of cloud GPU is cost, the greatest advantage is scalability.
Imagine you’re running a hyperparameter search job. Locally, you’ll have to sequence each training run one at a time. On the cloud, you spin up dozens of machines and run them in parallel. What used to take you several days to do is now a lunch break. In the data engineering world, this is known as horizontal scaling, and it’s awesome.
Equally awesome is vertical scaling. Decide you need an unusual amount of machine power for a specific task? You can scale up to machines with as many as eight V100 GPUs on them. And if you need even more than that, you can start to invest in distributed training schemes, a la horovod, that come with built-in support on the major cloud vendors.
The machine learning project workflow typically consists of an iterative mixture of interactive model prototyping (usually in a Jupyter notebook), launching/monitoring the progress of one or more model jobs, and the occasional large-scale hyperparameter search or production-level training job.
Interactive model prototyping, which typically has light compute requirements, is usually done on one’s own local machine. Model training jobs are executed on the local machine if the model is small enough, but non-trivially large training jobs go to the cloud. Large-scale model training jobs (a la hyperparameter search) are almost always executed on the cloud. If you’ve never executed 32 simultaneous model training jobs on the cloud, have you even lived? 😎
How Spell is evolving with it
One of the core tenets of our product philosophy at Spell is minimizing user friction. Even though we’re obviously big believers that the future of machine learning is in GPUs running on the cloud, we firmly understand that most machine learning workflows will continue to be local-first, cloud-next for years to come.
That’s why this week we’re excited to launch Spell For Private Machines. This new component of our product lets you execute model training jobs on your local machines that are connected to and manageable from the Spell CLI and web console.
We provide a Debian package, spell-worker-service.deb, that you can download and install on any Debian-based NIX system (most prominently, Ubuntu). This included a new spell-worker daemon to machine startup. Once you’ve assigned a NAME to the machine in the Spell web console, you will be able to submit jobs to the machine using the same spell run -m NAME [...] syntax you would use to submit the job to a cloud GPU.
All of your machine learning training jobs across both environments can now be managed using the Spell toolchain, meaning there is no more fragmentation between your local and remote training environments!