Learning to Colorize Hand-drawn Animation

Using Spell for MLOps at Cadmium

Cadmium is a new creative tool for animation that supercharges the clean-up pipeline for animators. We use ML to propagate changes across frames automatically, so animators only have to create a few keyframes instead of changing every frame by hand. 

Over the past 2 years, the Cadmium team has been heavy users of Spell -- we’ve run thousands of experiments on Spell (culminating in over 5 years of machine time!), iterated over 3 model architectures, and seamlessly on-boarded multiple new engineers. The ability to easily run experiments in the cloud and analyze model performance in a central, persistent place was the primary selling point for us; prior to using Spell, we either used local GPUs or a custom-built set of scripts to run our experiments in the cloud. Using Spell for our experiment infrastructure freed up precious engineering time and allowed us to analyze and collaborate on our experiments more effectively, ultimately enabling faster research progress.

Background

In 2018 we published a NeurIPS paper called Thinking Between the Lines: Guided 2D Animation with Generative Adversarial Networks, in which we trained a Pix2pix-style model to directly colorize animation frames from hand-drawn line images and a single color reference image. Although our early results showed promise, this model lacked an explicit module to match regions in the line image with their corresponding region in the reference image; it mostly relied on memorizing specific characters. As a result, the model didn’t generalize well to new characters and performed poorly on sequences with large transformations.

A diagram of the original Pix2Pix-style model:

In this diagram, a single fully-convolutional generator network takes as input a target line image, a reference color image, and the generated color image from. In training, we optimized an adversarial objective conditioned on the following:

  • L2 loss with ground truth
  • A "smoothness" discriminator network that evaluates the temporal consistency of the produced images
  • A "style" discriminator network that evaluates how aligned the generated image is with the color style of the reference image

Learning Visual Correspondence

Inspired by various papers on video segmentation and colorization, we shifted our focus to directly learning visual correspondence between animation frames. At the core of this approach is a non-local attention module, which approximates a correspondence matrix over each pixel in the reference/target frames.

This approach has several advantages over the Pix2pix-style model:

  • Matching regions between the target and reference frames are modeled explicitly and are not limited to local receptive fields
  • Visual correspondence is a building block for much more than just coloring -- it’s also the backbone of things like optical flow, texture transfer, and in-betweening

A diagram of the pixel-level visual correspondence is described below:

In this architecture, the reference and target line images are passed through a pre-trained Illustration2Vec network. The resulting features are then used by the non-local block to compute a correspondence matrix across pixels in the reference/target images. We then warp the reference color image using the correspondence matrix to produce the final output color image.

Per-pixel correspondences can be visualized by the magnitude of the correlation matrix, showing the areas where the model is most confident it has found a match:

How we built it

The pixel-level correspondence model uses the following tools:

  • Tensorflow 2.x with mixed precision training on Nvidia T4/V100 GPUs
  • TFRecords with sequences of images stored as tf.train.SequenceExample and then read in using tf.data.Dataset
  • High-level training/evaluation metrics sent to Spell Metrics API for easy access
  • Low-level model debugging metrics sent to Tensorboard and then visualized as histograms/distributions

At this point in the lifecycle of our company, we had just gone through the Betaworks Synthetic Camp accelerator (with fellow Spell customer Resemble.ai) and we had free credits with both AWS and GCP. Since Spell is cloud-agnostic, it enables us to easily switch between cloud providers and get the most out of our credits. 

For dependency management, we use Conda locally and pass our Conda environment files to Spell using the conda_file managed dependencies with Conda. We write assets (images and model checkpoints to disk), which Spell automatically fuses to S3 and can be accessed centrally via the Spell UI.

We heavily use hyperparameter grid searches to quickly run ablations across multiple experiment configurations and store our most commonly run hyperparameter search commands as bash scripts in our git repo in a scripts folder. Team members fork off of these scripts at will when running new experiments and over time, we update the common configurations as we reach a consensus on which model improvements we want to keep. We then store links and quick descriptions of these experiments in Notion so they can be shared with the team. 

Note: Spell now supports a rich feature set around using JSON/YAML command files instead of having everything defined in the run/hyperparameter search command. This is a great way to manage more complex configurations!

Transformers are eating Computer Vision

Our team released the pixel-level visual correspondence model in our desktop application in early 2020. While animators were enthusiastic about our tool, we found that the model still had several shortcomings stemming from the fact that predictions are being made on the pixel level and are thus heavily constrained by forward pass memory limits of the GPU. This, combined with the fact that animators usually work at HD or double-HD resolution (and sometimes even higher), meant that downsampling images to meet our memory constraints removed important details that hurt performance. We realized that the memory constraint of comparing every single pixel location would not scale, and thus set off once again to find a better architecture.

At the same time, we became inspired by advances in graph neural networks on computer vision and felt like we could better utilize the pre-existing structure of enclosed regions (segments) within the line image. We designed an architecture that operates directly on segments using a graph neural network, learning the global structure across segments in the image. This is akin to how humans perform colorization -- they look back-and-forth at both images, sift through tentative matching segments, examine each, and look for contextual cues that help disambiguate the true match from other self-similarities. 

In the above diagram, we have a similar CNN backbone to the pixel-level model, but we pool the convolutional features in each segment before passing them to a graph neural network. The graph neural network builds a segment-level correspondence matrix, which we use to produce the final colorized image.

As mentioned previously, we found it useful to visualize distributions of certain features across time to debug stability issues in our attention layers. For a particular metric, distributions for a specific feature are visualized across each of the runs in a hyper search:

Results

After stabilizing model training and optimizing our dataset to handle large batches (which are critical for training transformer models), we finally found some massive performance gains (more details will be available in our upcoming paper) motivated both by the relational intelligence of the graph neural network as well as the fact that can process images at HD.

To get an idea of what the model learns, we can visualize correspondences for individual segments using lines between matching segments:

We can dig even deeper here by looking at the full segment correspondence matrix. The bottom image shows the confidence of each possible match between the reference image (x-axis) and the target image (y-axis). The top two images are the legend that maps each segment to its row/column in the bottom image.

Conclusion

Deep learning is an empirical field, and making research breakthroughs requires a mind-boggling amount of trial-and-error (in addition to access to large amounts of compute). For these reasons, tooling is incredibly important and Spell has enabled Cadmium to accelerate our research across the board. Surprisingly, we also used Spell for a variety of use cases we didn’t even realize it could be used for, such as using Spell runs for offline data generation and looking back at saved run metrics to prevent code digressions. In the future, we’re excited about deploying our models using Model Serving, chaining together multiple commands with Workflows, and improving our testing workflow with Github CI/CD actions.

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.