Using DVC as a lightweight feature store on Spell

Feature stores is an idea that has rapidly risen in popularity in the machine learning world. Basically, a feature store is a service for pull down data ("features") you need in a well-supported, centralized manner. Feature stores are designed to solve one of the most common pain points of big data organizations everywhere: the tendency for "raw" object storage solutions, like datasets stored in S3, to naturally become a poorly documented, hard-to-understand mess over time.

In this blog post we’ll learn about DVC—an easy-to-use git-like data versioning system popular with users on Spell—and discuss how it can be used as a lightweight feature store on Spell.

You can follow along with the code on GitHub.

DVC basics

DVC purposefully makes extensive use of a git-like API, with commands like dvc pull, dvc fetch and dvc checkout that work in a manner very similar to that of git. As a result, DVC looks and feels very familiar to anyone that has worked with version control before.

To begin, we'll need some data. For the purposes of this demonstration, we will use a sample from the dataset A Year of Pumpkin Prices on Kaggle (specifically, new-york_9-24-2016_9-30-2017.csv). I downloaded the data to the file new_york_pumpkin_prices.csv on my local machine (if you are running this code in a Spell workspace, you can spell upload this file, then mount it into your workspace):

import pandas as pd
pd.read_csv("new_york_city_pumpkin_prices.csv").head()

This is a simple dataset of pumpkin prices (Low Price, High Price) by Variety and Item Size sold in New York City in the days prior to Halloween 2016.

To initialize DVC, run dvc init. To add data to DVC, run dvc add.

$ dvc init
You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

$ dvc add new_york_city_pumpkin_prices.csv
100% Add|██████████████████████████████████████████████|1/1 [00:00,  1.86file/s]

To track the changes with git, run:

    git add new_york_city_pumpkin_prices.csv.dvc .gitignore

Running dvc add creates a new_york_city_pumpkin_prices.csv.dvc file containing the MD5 content hash of the file. This file, which acts as a pointer to the "correct" version of the dataset, will be checked into git version control in place of new_york_city_pumpkin_prices.csv itself:

$ ls
devc-demo.ipynb                       new_york_city_pumpkin_prices.csv.dvc
new_york_city_pumpkin_prices.csv

$ cat new_york_city_pumpkin_prices.csv.dvc
outs:
- md5: 10ac52bb2b805fe1a9de704d2f5a5be1
  size: 11875
  path: new_york_city_pumpkin_prices.csv

$ cat .gitignore
/new_york_city_pumpkin_prices.csv

DVC then moves the dataset to a cache file in a content-addressable storage (CAS) filesystem in the special .dvc directory:

$ ls .dvc/cache/
10/
$ ls .dvc/cache/10/
ac52bb2b805fe1a9de704d2f5a5be1
$ head -n 2 .dvc/cache/10/ac52bb2b805fe1a9de704d2f5a5be1
Commodity Name,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,Mostly Low,Mostly High,Origin,Origin District,Item Size,Color,Environment,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,MICHIGAN,,xlge,,,,,,,,,N,

Notice that the version of the file that is stored in this cache directory does not have a human-readable name; instead, its name is its MD5-generated content hash. This allows DVC to de-duplicate datasets with the same contents, protect against corruption, and attach a robust version number to the dataset—all without requiring additional user action or changes in user behavior.

Of course, actual user code is unaware of the structure of the DVC cache; e.g. we still read("new_york_city_pumpkin_prices.csv"), not read(".dvc/cache/$SOME_HASH"). How do we do that? By running dvc checkout.

dvc checkout checks all of the *.dvc files for the datasets you need, finds those datasets' cache files, and creates a reflink (a copy-on-write filesystem pointer) from the dataset's original filename to the cache file.

In this way, DVC sets things up such that, at read time, the operating system transparently turns read("new_york_city_pumpkin_prices.csv") into read(".dvc/cache/$SOME_HASH") for us. This way, even though the data is now managed by DVC, your code doesn't need to change to account for it. Clever!

DVC remotes

So far we've only examined DVC's local workflow. However, DVC's best feature—its "killer app,"is the fact that it can be used in a push/pull-based workflow to enable team-wide data reproducibility.

To do this, you configure and use a remote. A remote in a DVC is a cloud-based entity—an S3 or GCS bucket, typically—that serves as the central storage space for your project's data (e.g. your DVC cache files).

Using dvc remote add:

$ dvc remote add spell-datasets-share s3://spell-datasets-share/dvc/

This adds a remote with this given name and URL to the project's DVC configuration file, .dvc/config:

['remote "spell-datasets-share"']
    url = s3://spell-datasets-share/dvc/

We can now use dvc push to send that data to S3:

$ dvc push --remote spell-datasets-share

  0% Uploading|                                      |0/1 [00:00<?,     ?file/s]
!
  0%|          |new_york_city_pumpkin_prices.cs0.00/11.6k [00:00<?,        ?B/s]
100%|██████████|new_york_city_pumpkin_pric11.6k/11.6k [00:00<00:00,    93.4kB/s]
1 file pushed

If we take a peek at the spell-datasets-share S3 bucket, we can see the result—a DVC cache directory has been created in the bucket:

$ aws s3 ls s3://spell-datasets-share/dvc/
PRE 10/
$ aws s3 ls s3://spell-datasets-share/dvc/10/
2021-01-26 11:15:46      11875 ac52bb2b805fe1a9de704d2f5a5be1

At this point, anyone with access to this S3 bucket, not just the dataset creator, can get this data onto their local machine. ✨

This is done using dvc pull:

$ rm new_york_city_pumpkin_prices.csv
$ dvc pull --remote spell-datasets-share
A    new_york_city_pumpkin_prices.csv                                    
1 file added
$ head -n 2 new_york_city_pumpkin_prices.csv
Commodity Name,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,Mostly Low,Mostly High,Origin,Origin District,Item Size,Color,Environment,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,MICHIGAN,,xlge,,,,,,,,,N,

In other words, in order to get a machine learning project on my machine to look just like the one on yours, all I need to do is:

$ git checkout <https://github.com/$USERNAME/$REPO>
$ cd $REPO
$ dvc pull --remote spell-datasets-share

This feature makes DVC a capable "lightweight" feature store, as I alluded to in the introduction. By combining DVC's push/pull workflow with good remote management best practices, you can effectively track and version all of your projects' data without having to build any specialized tooling.

However, DVC is not a great fit when working with very large (probably partitioned) files, e.g. "big data" stored in Dask or Spark. DVC transfers data over the network, which is bad: "big data" pretty much requires computing things at the source (this is why Spark cluster management is so important). DVC ensures cache consistency by calculating a content hash on the source data, also bad: that requires a full file scan, which takes O(hours) once the data gets to hundreds of GB and above in size.

In truth, 90% of machine learning use cases use data small enough to be versioned controlled under DVC without any problem. This is why DVC is so popular: it makes this really easy!

Combining DVC with Spell

Spell is a natural fit for DVC. In fact, Spell and DVC are both highly opinionated tools that put developer experience front and center, Spell fans are often DVC fans and vice versa.

One great feature of using DVC on Spell is how well DVC works with Spell mounts.

Mounts on Spell are paths within a Spell run, workspace, or model server that host data from an arbitrary S3 or GCS bucket you have access to. For example, here is a mount on a Spell run:

$ spell run --mount s3://my-bucket/path/to/something:/mnt/something \
    cat /mnt/something/a-file.txt

This example mounts the contents of the /path/to/something/ prefix within the s3://my-bucket bucket to /mnt/something/ on the filesystem of the machine running the code. Any code that this code executes will be able to see and interact with this data.

Under the hood, Spell mounts use an internal fork of goofys: an optimized, high-performance Go library for mounting an S3-compatible bucket storage resource to a POSIX filesystem. Every time a machine resource is created on Spell, an asynchronous resource manager is spun up in the background. This resource manager receives the list of mount source-destination pairs as its input, and downloads the contents of those pairs to disk (to learn more, refer to our blog post "Making the most of Spell mounts").

Importantly, the data download occurs in parallel with user code execution.

How is this done? Whenever you try to open a file, one of two things is possible: either the file has already been downloaded, or it hasn't. If the file is already on disk, everything proceeds as normal. If it isn't, Spell uses goofys to transparently serve that file directly off of S3. User code is completely unaffected—you may see longer-than-expected disk reads from time to time, but this is completely transparent to you as an end user.

DVC, by contrast, is completely synchronous. A call to dvc pull blocks until all data is done downloading, effectively splitting your program into two parts: the part where you download your dataset, and the part where you train or execute your model. This is much slower and less efficient than using a Spell mount, particularly when the amount of data you need to download is large—e.g. an image dataset with thousands of images.

Luckily you can have the best of both worlds: the performance of Spell mounts, the data versioning and ease-of-use of DVC, combined!

Suppose that we are using the s3://spell-datasets-share/dvc/ S3 bucket path as our centralized data storage space. We create a new Spell workspace and mount this path to it as follows:

$ spell jupyter --lab \
    --machine-type cpu \
    --mount s3://spell-datasets-share/dvc/:/mnt/dvc \
    dvc-demo

Next, inside of the workspace we run the following code, setting the /mnt/dvc/ directory (that will transfer data to) as our default remote:

$ dvc remote add --default s3-via-spell /mnt/dvc/

Now when we now run dvc pull:

$ dvc pull

Instead of downloading the data from the source S3 bucket, DVC will attempt to read the data from this local path. Any time that DVC tries to read a file that hasn't gotten to disk yet, goofys will kick in, block the call, pull the data, and release the read once it's in.

This is the best of both worlds because it combines the speed that goofys provides (downloads are multithreaded for performance) with the remote data management features that dvc is so good for.

That's all for now. Happy training!

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.