Spell runs, workspaces, and model servers allow you to seamlessly mount data from object storage (e.g. AWS S3, GCP GCS, Azure Blob Storage) using the --mount flag:
$ spell run --mount s3://my-bucket/foo/:/mnt/foo/ -- python train.py
However, how does this --mount magic actually work?
Today we're taking a peek under the covers, examining how mounts on Spell work. We'll discuss some of the optimizations that Spell has made to make mounted file reads performant. At the end I'll share some performance tips for making the most of your resource mounts on Spell.
How FUSE works
Before we can understand mounts on Spell, we need to first talk about a core UNIX technology: FUSE.
Every computer has a filesystem that handles the work of writing bytes to and reading bytes from the machine's attached storage media (usually a hard drive or an SSD). Filesystem management is one of the many important tasks handled by the kernel—the innermost piece of the operating system responsible for assigning resources to and from the applications running on the machine.
There is no one standard filesystem implementation. Most Linux distributions use a filesystem implementation called ext4, but there are alternatives like BtrFS and ZFS with different design trade-offs and performance characteristics.
The kernel filesystem on a machine can be extended one step further using FUSE. FUSE, which stands for Filesystem in Userspace, is a core component within the Linux kernel that allows users to mount custom filesystems. Once mounted, FUSE automatically captures operations (read, write, ls, etcetera) performed to this partition and forwards them to event handlers registered by the user. Thus the user (not the OS kernel) assumes responsibility for these operations—hence the name, "in userspace".
FUSE allows you to have one generalist filesystem implementation managing most of your machine resources, but use one or more specialized filesystems for specific resources that you want to manage in a special way. Example use cases include encrypting data at rest (improving security at the cost of performance), compressing data at rest (improving read performance at the cost of write performance), and providing access to networked data not actually located on the device.
As a specific example, consider Google's Cloud Storage FUSE, which uses the FUSE API to mount a GCS bucket to your local filesystem. This allows you to operate on objects in your cloud bucket as if they were located on your local machine (when in actuality they are stored in a Google data center somewhere).
How FUSE works at Spell
Spell runs, workspaces, and model servers allow our users to mount arbitrary bucket data to paths inside of their machine directory. Again, consider our example code:
$ spell run --mount s3://my-bucket/foo/:/mnt/foo/ -- python train.py
We use a version of goofys, an open-source FUSE filesystem written in Golang, to achieve this. Our version of goofys includes some modifications we’ve made to better suit our needs, some of which we've backported to the open source version on GitHub.
goofys abstracts cloud storage providers, allowing you to mount cloud storage buckets from any of the three major cloud vendors (AWS, GCP, and Azure) using the same tool. goofys (via the FUSE API) then intercepts any filesystem requests against those paths, transforming them into the corresponding network requests.
For example, suppose that we want to read some contents from s3://my-bucket/foo/bar.txt, mounted to /mnt/foo/bar.txt on our Spell machine, into memory. Using Python, we would begin by running e.g. fp = open("/mnt/foo/bar.txt"). The kernel forwards this request to FUSE, which forwards it to the event handler registered by goofys. goofys will kick off a network request to AWS S3 checking that this resource exists, before returning a file handler to FUSE, which passes it through to the Python caller.
When we later try to read some bytes from this file (fp.read(1024) or similar), goofys again intercepts the request, turns that into one or more network requests against AWS for those bytes, and returns the response payload to the caller.
goofys is a great starting point, but on its own it's not a good fit for our use case at Spell. The reason for this is caching.
In particular, for performance reasons, we wanted to implement cache prefetching. Cache prefetching is the pattern of fetching networked data the user is likely to need to a local cache ahead of time, e.g. before they actually request it.
Prefetching is possible on Spell because we know what files the user is interested in—it's the set of resources they've mounted to the device. As soon as machine execution starts, we want to use as much idle machine bandwidth as possible to load these files to disk, so that we hopefully have them before the reader requests them, thus short-circuiting additional network round trips at file open time.
Our solution, SpellFS, builds on goofys by adding exactly this functionality. SpellFS is a FUSE filesystem implementation that adds caching and readahead to goofys. At the heart of SpellFS is a simple cache mapping bucket paths to cache files.
At machine instantiation time, one of the services we spin up on our worker machines is a SpellFS daemon. This process reads the list of resources the user has mounted to the machine into a queue. A set of goroutines consume from this queue, downloading the files to a local cache directory (/var/cache/goofyscache).
Once a resource lands in the cache, all reads for that resource are redirected to that cache file.
If a user opens a file descriptor into a resource that has not yet been cached, Spell will fall back to serving that file directly from S3 using goofys. However, once a resource lands in the cache, all subsequent reads for that file are made from the cache. This is the ideal scenario, because it means no network round trips are needed.
Interested in optimizing your model training time on Spell? If you think that network bandwidth might be a limiting factor on your model training time, here are some tips for optimizing your file cache performance.
First of all, note that SpellFS has a maximum cache size: half of the total free space on disk at machine initialization time. Additionally, SpellFS has no cache eviction policy, so requests for data not included in a cache that's already full will always go to goofys, resulting in additional network round trips on each access.
Spell for Teams allows you to create machine types with arbitrarily sized disks. So when working with large datasets, make sure you pick a disk large enough to fit all of the data that you will need into SpellFS, if possible.
Note that some space on the device is reserved for the OS, utilities, and run environment Docker image; this space is not included in the cache size calculation. You can get an accurate assessment of actual total space available, and therefore cache size, using the du system utility.
Second, note that SpellFS loads all resources included in the mount. In order to ensure that it caches only the data you need, make sure to only mount what you need. So for example, if you are training a model on s3://my-bucket/share/my-dataset, mount that, not the broader s3://my-bucket/share/ path. In the latter case SpellFS will also index s3://my-bucket/share/other-dataset, s3://my-bucket/share/pictures-from-the-office-party, etcetera.
Third, note that SpellFS indexes resources in lexical (alphabetical) order. If you don't need to shuffle your data, you may see some performance benefit from having your script read it in lexical order as well.