Key Concepts for Deploying Machine Learning Models to Mobile

Introduction

The worlds of mobile development and machine learning are quite far apart. From programming languages and system architecture to the amount of specific knowledge needed to truly understand neural networks, the skill sets involved in mobile dev and machine learning are disparate. Training a fast, accurate machine learning model and building a killer user experience are very different problems.

In this blog post, we want to focus our attention on the ML side of things, outlining the key considerations in building a mobile-focused ML pipeline from end-to-end. Though we’ve tried to cover the core concepts and processes you’ll need to focus on, it’s important to keep in mind that you’ll likely need to dig deeper into each step of the lifecycle as you progress.

Part 1: Working with Data

Collecting and Labeling an Initial Dataset

Gathering an initial dataset is the first step in any machine learning project—regardless of where the resulting model(s) will be deployed. When targeting mobile devices, though, it’s particularly important to think carefully about the conditions in which applications will be used and augment training data accordingly.

For example, a perfect set of images scraped from the web leads to trained models that likely perform poorly in the real world. The perfectly lit, color-balanced, and well-focused images taken with professional camera equipment that populate the web look very little like the blurry, dark, poorly-framed photos that your users will be taking with their phone’s camera. 

As a result, models trained with just open-source data are low-accuracy when they leave R&D. On top of that, there are sometimes restrictive licenses with open-source datasets, so they can’t be used in commercial applications.

Given these challenges, what’s the path forward? 

First, there’s the old-fashioned, boots-on-the-ground approach: manually collect and label thousands of images of your target data type. This can be made easier by common image augmentation techniques (blurring, dimming, adding noise) that are baked into most modern machine learning frameworks, and by the high-quality sensors on mobile phones themselves. If you wish to proceed with this approach, you’ll need to make sure you have the resources and time to build a sophisticated pipeline for manually collecting and annotating images.

But for the vast majority of mobile machine learning projects, that budget, both in terms of time and money, is simply unviable. One promising alternative (though not a silver bullet) is to leverage what’s known as synthetic data.

Non-realistic synthetic images used to train a robot to pick up objects in the real world (PDF link)

Put simply, synthetic data is data that’s generated programmatically. This could include photorealistic images of objects in arbitrary scenes that are rendered using video game engines, or audio generated by a speech synthesis model from known text. What’s more, the generative process can automatically augment this synthetic data with, for example, crops, flips, rotations, and distortions for visual data.

Building a Ground-Truth Data Collection Pipeline

Ideally, your initial dataset will be robust, diverse, and will address all the edge cases your application requires—but this is rarely the case, especially for more mature applications that will provide wide ranges of user input data.

To address this issue, and to ensure you’re able to iterate on models in production effectively, you’ll need to set up a system to collect and label ground-truth data, either collected and shared by users (be sure to ask their permission) or through internal testing.

The closer ground-truth data is to what your model sees in the wild, the more likely you’ll be able to retrain models that are more performant across devices.

Part 2: Model Training 

Today, most model training happens in the cloud. Datasets are often large, and optimizing hundreds-of-millions of parameters in a neural network requires a lot of processing power. In the future, AI-specific mobile processors will enable training performed directly on mobile devices, keeping user data private and secure. For now, a variety of cloud-based training platforms like Spell support exporting trained models directly to mobile-friendly formats like Apple’s Core ML or Google’s TensorFlow Lite.

Regardless of where your model is being trained, the best results are achieved when the training environment matches deployment as closely as possible. This means making sure that ML teams apply mobile-specific data augmentation like motion blur and simulate optimizations like quantization (covered in more detail below) inside the training loop. Keeping track of metadata related to each model you train—including the datasets used, hyperparameters, the platform it’s targeting, and any other versioning information—is critical for evaluating all experiments and picking the best one.

This can be a lot to keep track of, but there’s more when it comes to building models specifically for mobile. For example, machine learning models rarely take raw data as input and output predictions in a usable format. Sophisticated pre- and post-processing code is required to facilitate training. That code generally lives outside of the model and needs to be ported over to mobile platforms, as well. Training a single model can take days and potentially cost thousands of dollars. Careful preparation and organization is key to reducing mistakes and shortening development time.

For computer vision applications, pre-processing code can include things like handling input image rotation and normalizing pixel counts and other image-dependent variables. Post-processing code often takes large matrices of probabilities and pulls out specific predictions about what objects are in an image. While there are many great libraries for these types of mathematical operations in server-side frameworks, there is very little support on mobile platforms. A single line of post-processing code written in Python might require dozens of lines of complicated Java or Swift code to enable ML models to correctly handle inputs and outputs on mobile.

Even though there are an ever-growing number of powerful, incredible open source models for a wide range of possible use cases, preparing mobile-ready variants means more than just simply converting them to a mobile-ready format (covered in more detail below). 

Training high-performant, mobile-ready models from scratch, or even from an open source implementation, can take lots of time and effort, especially if your use case requires a high degree of customization in handling what’s fed into it and what comes out of it.

If your use case is broad enough and the ML experience you’re trying to build doesn’t need a high level of customization, then it’s possible you might be able to use models that have already been trained and converted into mobile-ready formats. 

Both TensorFlow Lite and Core ML have a set of pre-trained and already-converted models that are ready to be dropped into mobile projects, as do cross-platform frameworks like Google’s ML Kit and Fritz AI. However, these models typically come as they are—you won’t be able to add new objects to an object detection model or new classes to an image recognition model. While workable in some cases, the limitations here are clear.

Sometimes pre-trained models just won’t do. When training your own model there are a few things to keep in mind at each step of the training process:

  1. Apply mobile-specific data augmentation. For computer vision models that means effects like motion blur, camera noise, and exposure.
  2. Applying optimizations (more on those below) during training will minimize accuracy losses.
  3. Always hold back a fraction of your most representative data for validation and testing.
  4. Automate the tracking of training configurations for easy analysis later.

To speed up this process, development teams can also use what’s known as transfer learning to short-circuit a sizable amount of the computation resources needed to build a mobile-ready model. Transfer learning works by taking a model that’s already been trained for one task and using it as a starting point for building the model for your use case. 

Part 3: Optimizing Models for Mobile

Optimizing models for mobile usage is critical to maintaining smooth, reliable user experiences. Milliseconds matter for cameras processing live video. Unfortunately, the vast majority of ML research has taken the availability of large GPUs and lots of memory for granted. The result has been impressively accurate models at the expense of computation and storage costs. Building models that run on battery-efficient mobile processors requires taking advantage of optimization techniques such as selecting the correct model architecture, model pruning, and compression.

Model Architecture

Choosing the right architecture is one of the most important decisions you’ll have to make. Many popular neural networks such as VGG or Mask-RCNN rose to fame thanks to their incredibly accurate predictions. Unfortunately, these models often contain hundreds of millions of parameters and can take up as much as 500MB of disk space. This isn’t going to cut it for mobile devices. Instead, mobile machine learning use cases require smaller, more efficient architectures like MobileNet or SqueezeNet. These models take up a fraction of the space (5-15MB) while sacrificing only a few percentage points of accuracy. 

It’s also important that architecture selection takes specific layers and mathematical operations into account. State-of-the-art models from the latest papers may have great performance on the latest generation of GPUs, but if mobile hardware doesn’t support the specific calculations made within the model, they may not run at all or will be relegated to the CPU, making them unusable for your app.

Model Pruning & Distillation

To extract even more efficiency from models, techniques such as model pruning and distillation can be applied. It turns out that only a very small fraction of a neural network’s parameters are responsible for accurate predictions. Pruning techniques iteratively remove useless parameters during training, resulting in smaller, faster models, without a loss of accuracy. 

Visualization of model pruning

Left: Lecun et al. NIPS’89; Right: Han et al. NIPS’15 (Image Source)

Similarly, a technique known as knowledge distillation can be used to train much smaller networks that mimic the results of larger, more accurate ones. Both of these techniques can be used to shrink the size and improve the speed of models by 20-50%, with minimal impact on accuracy.

Quantization

The last important mobile optimization technique is quantization. By default, most machine learning models are trained with parameters stored as 32-bit floating point numbers. In practice, there is no reason for calculations to be accurate out to the 8th decimal place. Quantizing model parameters to 8-bit integers or smaller can reduce model size by a factor of 4 or more while improving speed. Amazingly, if quantization is simulated during training, this compression results in almost no loss in accuracy.

Part 4: Model Conversion

Models written and trained in server side frameworks and languages aren’t always compatible with mobile devices. Mobile platforms like iOS and Android require specific formats to take advantage of hardware acceleration. For example, while it is possible to run TensorFlow Lite models on iOS, Apple’s proprietary Core ML format is the only way to run models on the Apple Neural Engine (ANE) which boasts speedups of 10X and power reductions of 9X over models run on the GPU.

Converting models for each target platform can be a tedious, fragile, and time-consuming process that requires repetitive code. Mobile frameworks for machine learning are still in their infancy and standards can take years to adopt. What’s more, updates to either model training frameworks or common conversion tools can cause breaking changes that require workarounds and make certain kinds of customization immensely difficult.

 Here are the key things to keep in mind regarding model conversion:

  • Not all neural network architectures will convert to mobile formats: This means you’ll need to test model conversion early and often to ensure that the mathematical operations underlying deep learning architectures are supported by mobile model formats.
  • Stay up to date with the newest converter releases: Newest versions generally offer the most robust support and performance, but they can be unstable and need to be tested regularly.

Part 5: Model Deployment

Every mobile platform has a different set of APIs for integrating and executing models in an app. Making these methods and any pre- and post-processing consistent across platforms improves the maintainability of your code and reduces errors. 

You also have to choose whether or not to bundle your models with your application code or download them at runtime. Bundling makes for a smoother user experience but also results in a larger package size. Downloading at runtime offers more flexibility but increases bandwidth usage.

Beyond these technical considerations, there’s an overarching product dynamic to remember when it comes to mobile ML: model rollouts are feature rollouts. Best practices for shipping products still apply. Updates should roll out over the air at times when they won’t disrupt users, be released to a small fraction of devices to make sure performance is acceptable, and A/B tested in situations where user behavior might be impacted.

This is especially true when attempting to deploy your models cross-platform. This kind of deployment requires that special attention be paid to the target platforms for deployment and the range of possible devices that models will be deployed to. In some cases, different model variants might be more or less suitable for different devices, depending on size, specific hardware constraints/opportunities, and more.

Part 6: Model Monitoring

Once you’ve deployed your model into a production app, it might seem like all the work has been done—however, you’ll need to closely monitor your model (or models) in production.

Unlike cloud environments where you have a robust logging infrastructure, deploying onto mobile devices often leaves you flying blind. 

You’ll need a system for measuring runtime performance, memory usage, battery drain, and accuracy—all of which should be responsive to heterogeneous hardware sets. 

Additionally, it’s essential to create a workable system for tracking, tagging, and accessing model versions so that you have a full picture of your mobile ML project throughout the development lifecycle. Traditional cloud-based training platforms include some of this capability, but there’s extra difficulty involved when creating model variants that include both Core ML and TensorFlow Lite versions, as well as original model files, model training checkpoints, and more.

To illustrate this, it might be helpful to look at a hypothetical use case.

For instance, if your model is supposed to analyze video in real-time (i.e. runs at 30 FPS on all target mobile devices), there are a number of steps you’ll need to take in terms of monitoring and managing your workflow. Specifically, you’ll need to set up an alert system that, at minimum: 

  • Notifies you when there are significant changes to input data.
  • Generates alerts for predictions that may indicate a failure.

Additionally, to ensure you’re providing the best experience regardless of device, you’ll need to institute a well-conceived system for tracking and distributing model versions that includes:

  • Tags and metadata that indicate important version details
  • A system for segmenting models by type (i.e. Core ML, TensorFlow Lite)

Part 7: Putting it all together

Model Training and Iteration

Given all of these systems and lifecycle steps we’ve addressed above, the last thing to keep in mind when it comes to mobile ML is that first versions of models are almost never the final versions of models.

Using the systems you’ve established throughout the lifecycle, you’ll need to decide when, how, and on what data model versions should be trained—either to address new edge cases, deploy more performant models, access new hardware or runtimes, etc.

The more work you’ve put in upfront to build a reproducible, scalable mobile ML pipeline, the easier the process will become, and you’ll be able to effectively iterate on your ML models as your mobile app scales.

For more info on the burgeoning mobile AI space, download The State of Mobile Machine Learning, co-authored by Fritz AI and Spell, for insights from 500+ industry leaders!

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.