Improving Upon Stereo Image Datasets: Holopix50k

Leia Inc. is the leading provider of Lightfield hardware and content services. They have been using Spell to assist with their research in advancing Lightfield computer vision. A recent work published by the team is Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset. This post is an overview of their work with their detailed research paper found here: https://arxiv.org/abs/2003.11172

Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset

The advent of dual-camera mobile phones has meant that stereo images for computer vision  are becoming increasingly important. Dual-camera phones allow capturing stereo images which can be processed to create 3-D vision and other enhanced visual effects. The algorithms to process these images are machine learning based, and therefore depend on having large training samples of high quality. Up till now however, existing stereo image datasets have been limited in size and quality. 

Holopix50k is a new, in-the-wild produced stereo image dataset containing close to 50,000 images contributed by users on the Holopix mobile social platform. This dataset has shown significantly improved results on computer vision tasks compared to existing datasets.

Background

Mobile phones with two or more cameras have enabled new consumer applications such as artificial depth of field, 3-D photography, and AI-based photo optimization. Dual cameras facilitate capturing more information about a scene than a single photo could capture. Several applications can take advantage of this stereo imagery, including Holopix™, and platforms from Facebook and Snap Inc.

Most state-of-the-art methods for handling stereo imagery are deep learning based, and the quality of the dataset used to train them can have a big impact on their performance. Existing datasets have been lacking, covering a subset of real-world scenarios or being generated in an artificial lab setting. Furthermore, the diversity of the images and overall size of the dataset are key for providing a good representation of in-the-wild mobile photography.

Holopix50k Dataset

The new dataset, Holopix50k, is the largest in-the-wild stereo image dataset to date, containing 49,368 high quality stereo image pairs. The dataset was crowd-sourced from the Holopix platform which is the only major social platform for sharing 3-D photography. When coupled with Lightfield displays, the images can be viewed with added depth and a multi-view parallax effect. If the user doesn’t have a Lightfield display, the multi-view image can be viewed using motion-based animation on regular devices.

A large majority of the images on Holopix™ are captured using the RED Hydrogen One mobile phone, one of the first consumer-grade Lightfield devices. To generate the Holopix50k dataset, 70,000 image pairs were collected and then filtered to eliminate image pairs with poor stereo characteristics. One significant issue is vertical disparity - since the cameras are horizontally aligned, the pairs should have horizontal disparity but any vertical disparity can cause failures in the stereo algorithms. Therefore the dataset was filtered to remove pairs with vertical disparity. 

Dataset Diversity

To assess the diversity of the dataset, it was run through an object detector to determine what types of objects are in the photographs. It found a good mixture of common objects including people, animals, plants, vehicles, furniture, electronics, and food. Although the majority of images were taken in landscape mode, there are also about 4% in portrait orientation, which adds diversity to the dataset in terms of content like stereoscopic selfies.

Compared to existing popular stereo datasets, Holopix50k is at least five times larger than other datasets and performs well on perceptual metrics. It has the highest score on SR-metric, which indicates it contains higher quality images with respect to human visual perception. It scores the second highest in entropy, which shows that on average the diversity and density of information in the images is high. 

Experimenting with Stereo Super Resolution Algorithms

Stereo super resolution tasks extend the method of super resolution (creating a high-resolution image from low-resolution versions) to the stereo two-image domain. A second low-resolution image is introduced to increase the quality of the super-resolved image. Recently, machine learning based algorithms have been applied to this task.

The current state-of-the-art method for stereo super resolution is PASSRNet, a model originally trained on the Flickr1024 dataset. It was re-trained using the Holopix50k dataset and the results showed fine details and textures much better than the original model. The large size of the dataset is important - when using only 1,000 Holopix50k image pairs, the results did not outperform the Flickr1024 model. But as the training sample size increases, the results improve, by doing better on generalization and avoiding over-fitting. 

Self-Supervised Monocular Depth Estimation

Monocular depth estimation is a technique that attempts to estimate the depth of a scene from a single image. This is a difficult problem, as image textures don’t directly correspond to depth. Some monocular techniques are self-supervised with a stereo input. Existing datasets, however, have been limited in domain or size and the results have failed to generalize to tasks found in-the-wild. 

When the Holopix50k dataset was tested on the Monodepth2 depth estimation model, it performed better than the original dataset (KITTI) on all metrics from two test sets. This demonstrates that the Holopix50k dataset helps a model originally trained on road scenes (KITTI) generalize to entirely new scenarios. 

Disparity Models

Disparity estimation algorithms recently have begun to use self-adaptation approaches to estimate disparity. A diverse dataset can aid in generalization of these unsupervised techniques. Holopix50k was used to train several disparity models to explore possible practical use cases. 

On a stereo disparity estimation network for mobile inference, the dataset produced results with sharp edge detail and stereo consistency. On a real-time disparity estimation network optimized for speed, results had fewer edge details but demonstrated feasibility for real-time applications like camera previews or video calls. On a monocular depth estimation network, the Holopix50k dataset enables results that perform well at all degrees of relative depth, from close-up to far away. 

Conclusion

The Holopix50k dataset will continue to be developed in future iterations as the social sharing platform grows. Currently it’s the largest stereo image dataset collected from social media and contains a variety of diverse scenes. Tested on several stereo image algorithms, it outperformed existing datasets that are available. The Holopix50k dataset has many possible uses and its creators are looking forward to novel applications of it in the stereo learning-algorithms domain. The dataset is live and can be downloaded at https://github.com/LeiaInc/holopix50k

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.