An introduction to AutoML with Ludwig

Ludwig is an open-source AutoML toolkit. Ludwig was originally developed internally within Uber. It was open sourced in February 2019, and is presently under incubation with the Linux Foundation.

As an open-source AutoML tool, Ludwig competes with many closed-source tools like Google AutoML. While there are other open-source AutoML solutions—AutoGluon, which is used by AWS SageMaker, is another example that springs to mind—Ludwig is arguably the best known toolkit, and definitely the most mature.

This blog post is an introduction to Ludwig. We'll talk about the basics of AutoML, introduce Ludwig, and demo how it works. You can follow along in code by checking out the GitHub repository.

How it works: training

AutoML is often marketed as "code-free machine learning", which is a pretty good (albeit marketing-y) summary of what Ludwig aspires to be: a way to build machine learning models without writing code.

For demonstration purposes, we will use the Wine Reviews dataset on Kaggle and try to predict the (numerical) score value, a number out of 100, based on the text of the review, e.g. the contents of the description field in the dataset.

Ludwig abstracts away all of the details (and decisions) involved in choosing model architecture, encoders, decoders, and training/evaluation loop. Ludwig is also declarative: instead of writing a code file explaining how the model works, you write a YAML configuration file stating what the model actually does. It’s much like SQL, but for machine learning. To draw a concrete example, here's an example YAML configuration (which we'll revisit later in this article):

input_features:
    -
        name: description
        type: text

output_features:
    -
        name: points
        type: numerical

This configuration file specifies a model which takes a dataset containing a description field of type text as input, and returns a points field of numerical type as output.

Notice that we're not providing any details whatsoever about the model itself. We're deferring any and all decisions about the model architecture to Ludwig. To train this model, we execute the following ludwig train CLI command:

$ ludwig train \
  --experiment_name "wine_reviews_experiment" \
  --model_name "wine_reviews_model" \
  --config_file "../datasets/wine_reviews/cfg.yaml" \
  --dataset "/mnt/wine-reviews/winemag-data_first150k.csv"

When it comes to data, for tabular learning tasks, any file format readable by the venerable pandas library can be used as input. Ludwig can work with audio data and image data too.

At runtime, Ludwig looks at the combination of inputs and outputs, and devises a deep neural network model architecture which is best suited for the task. It trains that model and saves the resultant model weights to disk (along with a whole heap of metadata and training statistics files—we'll revisit these later). Look ma, no code!

Here are some logs from executing this command:

╒══════════╕
│ TRAINING │
╘══════════╛

Epoch   1
Training:   0%|                                         | 0/826 [00:00<?, ?it/s]2020-12-22 19:30:27.982458: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2020-12-22 19:30:28.002259: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
Training: 100%|███████████████████████████████| 826/826 [09:21<00:00,  1.47it/s]
Evaluation train: 100%|███████████████████████| 826/826 [02:11<00:00,  6.28it/s]
Evaluation vali : 100%|███████████████████████| 118/118 [00:18<00:00,  6.25it/s]
Evaluation test : 100%|███████████████████████| 237/237 [00:38<00:00,  6.21it/s]
Took 12m 30.9954s
╒══════════╤════════╤═════════╤══════════════════════╤═══════════════════════╤════════╕
│ points   │   loss │   error │   mean_squared_error │   mean_absolute_error │     r2 │
╞══════════╪════════╪═════════╪══════════════════════╪═══════════════════════╪════════╡
│ train    │ 3.8799 │  0.9473 │               3.8799 │                1.5428 │ 0.6258 │
├──────────┼────────┼─────────┼──────────────────────┼───────────────────────┼────────┤
│ vali     │ 4.4000 │  0.9755 │               4.4000 │                1.6391 │ 0.5771 │
├──────────┼────────┼─────────┼──────────────────────┼───────────────────────┼────────┤
│ test     │ 4.4757 │  0.9776 │               4.4757 │                1.6382 │ 0.5691 │
╘══════════╧════════╧═════════╧══════════════════════╧═══════════════════════╧════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 3.8786 │
├────────────┼────────┤
│ vali       │ 4.6211 │
├────────────┼────────┤
│ test       │ 4.8027 │
╘════════════╧════════╛
Validation loss on combined improved, model saved

[...]

Epoch  14
Training: 100%|███████████████████████████████| 826/826 [09:09<00:00,  1.50it/s]
Evaluation train: 100%|███████████████████████| 826/826 [02:09<00:00,  6.40it/s]
Evaluation vali : 100%|███████████████████████| 118/118 [00:18<00:00,  6.37it/s]
Evaluation test : 100%|███████████████████████| 237/237 [00:36<00:00,  6.42it/s]
Took 12m 13.7393s
╒══════════╤════════╤═════════╤══════════════════════╤═══════════════════════╤════════╕
│ points   │   loss │   error │   mean_squared_error │   mean_absolute_error │     r2 │
╞══════════╪════════╪═════════╪══════════════════════╪═══════════════════════╪════════╡
│ train    │ 1.3459 │ -0.9042 │               1.3459 │                0.9710 │ 0.8709 │
├──────────┼────────┼─────────┼──────────────────────┼───────────────────────┼────────┤
│ vali     │ 2.7907 │ -0.8918 │               2.7907 │                1.2914 │ 0.7318 │
├──────────┼────────┼─────────┼──────────────────────┼───────────────────────┼────────┤
│ test     │ 2.8863 │ -0.8765 │               2.8863 │                1.2986 │ 0.7224 │
╘══════════╧════════╧═════════╧══════════════════════╧═══════════════════════╧════════╛
╒════════════╤════════╕
│ combined   │   loss │
╞════════════╪════════╡
│ train      │ 1.3477 │
├────────────┼────────┤
│ vali       │ 2.9253 │
├────────────┼────────┤
│ test       │ 2.9301 │
╘════════════╧════════╛
Last improvement of combined validation loss happened 5 epochs ago

EARLY STOPPING due to lack of validation improvement, it has been 5 epochs since last validation improvement

Best validation model epoch: 9
Best validation model loss on validation set combined: 2.547654151916504
Best validation model loss on test set combined: 2.704821825027466

Finished: wine_reviews_initial_0_experiment_wine_reviews_initial_0_model
Saved to: results/wine_reviews_initial_0_experiment_wine_reviews_initial_0_model_1

By default, Ludwig uses a 70-20-10 train-validation-test random split, and trains for 100 epochs, with early stopping (based on the validation loss metric) enabled after 5 epochs. A batch size of 128 is used. The Adam optimizer is used, and the learning rate is set to 0.001. No learning rate annealing or decay is used. Training is interruptible and resumable via checkpointing.

Under the hood, a Ludwig model consists of three parts. First, a set of encoders translate the raw data into a neural network friendly format, column-by-column. For example, the text encoder tokenizes input text, applies lemmatization to it, and stores the result in matrix format. Then, a combiner (really just a concat layer, with optional fully connected layers) combines the outputs of the individual encoders into a single record. Finally, a decoder takes the combined record and outputs a computed result.

The Ludwig docs include the following helpful diagram visualizing the full set of input and output options:

Completing a training run writes the model weights file to disk. A predict or evaluate CLI commands can be then be pointed at this checkpoints file to batch score the model on new data. A serve command is also available—this can spin up a basic HTTP POST endpoint encapsulating your model for you.

The model config file, cfg.yaml, allows you to parameterize most of the important attributes of the model, its layers, and the training loop. For example, you can change the number of fully connected layers with num_fc_layers and fc_size, or change the layers' weights initialization with weights_initializer. For the training loop, you can change the number of epochs, add learning rate plateaus, and so on. For example, the following configuration file uses character tokenization, sets a maximum sequence length of 1024, reduces maximum epochs to 10, enables a learning rate plateau, and changes the validation metric to MAE:

input_features:
    -
        name: description
        type: text
    preprocessing:
        char_tokenizer: characters
        char_sequence_length_limit: 1024

output_features:
    -
        name: points
        type: numerical

training:
    'epochs': 10
    'reduce_learning_rate_on_plateau': 1
    'reduce_learning_rate_on_plateau_patience': 3
    'validation_metric': 'mean_absolute_error'

How it works: visualization

Once you've run ludwig train, the next command you're likely to run is ludwig visualize. Ludwig provides a large number of different model performance plots. The standard-bearers of the genre are all there: learning curves, confusion matrices, ROC curves, the like. Ludwig writes some JSON files to disk; pass the relevant subset of these files to the command, and get a plot back out:

$ ludwig visualize --visualization learning_curves \
  --output_feature_name points \
  --training_statistics results/wine_reviews_initial_0_experiment_wine_reviews_initial_0_model_1/training_statistics.json \
  --output_directory results/visualizations/ \
  --file_format png \
  --model_names Model1

Ludwig also writes tfrecords files that can be visualized in TensorBoard, so you can visualize the same information by visiting the TensorBoard:

$ tensorboard --logdir path/to/ludwig_model/

Advanced features

Ludwig has built-in support for hyperparameter search, via Hyperopt, and distributed training, via Horovod. These two features are interesting enough to warrant their own article, so we won't cover them here. 😅

When should you use AutoML?

AutoML tools (like Ludwig) are pretty magical—they take machine learning to the highest possible level of abstraction.

This results in some very important practical tradeoffs. You can think of all deep learning tools as existing on a spectrum of complexity. If we were to draw this analogy using the PyTorch ecosystem, that spectrum would look something like the following:

Models written in pure PyTorch are the most complex. This model requires probably 100 lines or so of PyTorch code, 100 lines that can only be written by someone familiar both with PyTorch and with the task at hand. Using PyTorch Lightning would allow us to get rid of most of the training boilerplate, reducing total code length to 30 or 40 lines. Ludwig does it in 6 lines of YAML.

The flip side of complexity in code is the complexity of the ideas which that code may express. While Ludwig comes with extensive parameterization options, you are ultimately stuck with whatever set of architectures Ludwig provides for the given task.

To give a concrete example, consider learning rate schedulers. Ludwig provides learning rate plateauing and learning rate warmup, but it doesn't offer cosine annealing or one-cycle learning rate schedulers. If you want to use one of the latter two learning rate scheduling techniques, and you're using Ludwig, you're simply out of luck!

PyTorch Lightning and PyTorch provide more and the most room, respectively, to express these ideas.

This raises the question of model performance. Given a sufficiently large time budget, a hand-tuned model built in a lower-level framework by an experienced practitioner will ultimately beat out a code-generated model in performance (at a Kaggle Days San Francisco 2019 hackathon, a couple of teams of Kagglers were able to outperform a Google AutoML model after eight hours of work). In other words, Ludwig will give you a good benchmark model in O(minutes), but an experienced practitioner can probably beat it in O(days) of work.

Hence, Ludwig is best applied to tasks where (1) the problem is well-defined and not likely to change, and (2) performance is not at a premium, but engineering time is. It's a great fit for rapid prototyping in particular. Ludwig makes throwing neural nets at a dataset just as easy as throwing scikit-learn models at it has been, a trait that's extremely useful during the initial discovery or mock-up phase of a project.

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.