In this short tutorial we are going to take a look at the amazing PIFuHD single image-based 3D human shape reconstruction model. We will use a Jupyter Lab pipeline in a Spell workplace to turn some custom images into 3D models using a the pre-trained model from the PIFuHD GitHub repo (from the paper "Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization").
Sign up for a free Spell account and get 10$ worth of GPU credits you can use towards this tutorial.
We will create and use a Spell workspace for this project. Go to the workspaces page, create a new workspace with a suitable name, and set the GitHub URL to point to the PIFuHD repo:
On the next page we will fill out our other environment configuration details: the machine type (let's use a K80 GPU), framework (let's use default), pip requirements (we will need the scikit-image, tqdm, and pycocotools packages), and apt requirements (we will need ffmpeg).
Click through to finish set up and drop into the Jupyter Lab demo notebook!
Now, before we go any further, we will need to create a sample_image folder on the root directory to store our sample images. We will use this folder to upload some test images for the model.
To test the model, you will need high-resolution image of humans. There are a few tips from the authors mentioned to get better results:
- Use a high-res image. The model is trained with 1024x1024 images. Use at least 512x512 with fine-details. Low-res images and JPEG artifacts may result in unsatisfactory results.
- Use an image with a single person. If the image contains multiple people, reconstruction quality is likely degraded.
- Front-facing with standing works best (or with fashion pose).
- The entire body is covered within the image (missing legs are partially supported).
- Make sure the input image is well lit. Extremely dark or bright images and strong shadows often create artifacts.
- I recommend a nearly parallel camera angle to the ground. High camera height may result in distorted legs or high heels.
- If the background is cluttered, use a less complex background or try removing it using https://www.remove.bg/ before processing.
- It's trained with humans only. Anime characters may not work well.
- Search on twitter with #pifuhd tag to get a better sense of what succeeds and what fails.
For the purposes of this article, we will use the following high-resolution human pose test image:
TODO: FIND THIS IMAGE
But you can use any image you'd like.
Trying it out
First, we need to do some path munging to get our desired input and output file path.
import os filename = 'img1.png' image_path = '/spell/sample_images/%s' % filename image_dir = os.path.dirname(image_path) file_name = os.path.splitext(os.path.basename(image_path)) # output paths obj_path = '/spell/pifuhd/results/pifuhd_final/recon/result_%s_256.obj' % file_name out_img_path = '/spell/pifuhd/results/pifuhd_final/recon/result_%s_256.png' % file_name video_path = '/spell/pifuhd/results/pifuhd_final/recon/result_%s_256.mp4' % file_name video_display_path = '/spell/pifuhd/results/pifuhd_final/result_%s_256_display.mp4' % file_name
Once it is done, our next task is to pre-process the image to make it ready to fit in the model. We are going to download some scripts and pre-trained models to preprocess our image data (note: you can run bash code in Jupyter Lab using the ! prefix, and we are using that feature here):
!git clone https://github.com/Daniil-Osokin/lightweight-human-pose-estimation.pytorch.git cd lightweight-human-pose-estimation.pytorch/ !wget https://download.01.org/opencv/openvino_training_extensions/models/human_pose_estimation/checkpoint_iter_370000.pth
The next thing we need to do is crop our example image to fit the expected (square) shape. Here's the function we will use:
import torch import cv2 import numpy as np from models.with_mobilenet import PoseEstimationWithMobileNet from modules.keypoints import extract_keypoints, group_keypoints from modules.load_state import load_state from modules.pose import Pose, track_poses import demo def get_rect(net, images, height_size): net = net.eval() stride = 8 upsample_ratio = 4 num_keypoints = Pose.num_kpts previous_poses =  delay = 33 for image in images: rect_path = image.replace('.%s' % (image.split('.')[-1]), '_rect.txt') img = cv2.imread(image, cv2.IMREAD_COLOR) orig_img = img.copy() orig_img = img.copy() heatmaps, pafs, scale, pad = demo.infer_fast(net, img, height_size, stride, upsample_ratio, cpu=False) total_keypoints_num = 0 all_keypoints_by_type =  for kpt_idx in range(num_keypoints): # 19th for bg total_keypoints_num += extract_keypoints(heatmaps[:, :, kpt_idx], all_keypoints_by_type, total_keypoints_num) pose_entries, all_keypoints = group_keypoints(all_keypoints_by_type, pafs, demo=True) for kpt_id in range(all_keypoints.shape): all_keypoints[kpt_id, 0] = (all_keypoints[kpt_id, 0] * stride / upsample_ratio - pad) / scale all_keypoints[kpt_id, 1] = (all_keypoints[kpt_id, 1] * stride / upsample_ratio - pad) / scale current_poses =  rects =  for n in range(len(pose_entries)): if len(pose_entries[n]) == 0: continue pose_keypoints = np.ones((num_keypoints, 2), dtype=np.int32) * -1 valid_keypoints =  for kpt_id in range(num_keypoints): if pose_entries[n][kpt_id] != -1.0: # keypoint was found pose_keypoints[kpt_id, 0] = int(all_keypoints[int(pose_entries[n][kpt_id]), 0]) pose_keypoints[kpt_id, 1] = int(all_keypoints[int(pose_entries[n][kpt_id]), 1]) valid_keypoints.append([pose_keypoints[kpt_id, 0], pose_keypoints[kpt_id, 1]]) valid_keypoints = np.array(valid_keypoints) if pose_entries[n] != -1.0 or pose_entries[n] != -1.0: pmin = valid_keypoints.min(0) pmax = valid_keypoints.max(0) center = (0.5 * (pmax[:2] + pmin[:2])).astype(np.int) radius = int(0.65 * max(pmax-pmin, pmax-pmin)) elif pose_entries[n] == -1.0 and pose_entries[n] == -1.0 and pose_entries[n] != -1.0 and pose_entries[n] != -1.0: # if leg is missing, use pelvis to get cropping center = (0.5 * (pose_keypoints + pose_keypoints)).astype(np.int) radius = int(1.45*np.sqrt(((center[None,:] - valid_keypoints)**2).sum(1)).max(0)) center += int(0.05*radius) else: center = np.array([img.shape//2,img.shape//2]) radius = max(img.shape//2,img.shape//2) x1 = center - radius y1 = center - radius rects.append([x1, y1, 2*radius, 2*radius]) np.savetxt(rect_path, np.array(rects), fmt='%d')
Now run the preprocessing job:
net = PoseEstimationWithMobileNet() checkpoint = torch.load('checkpoint_iter_370000.pth', map_location='cpu') load_state(net, checkpoint) get_rect(net.cuda(), [image_path], 512)
Now we have to download the pre-trained model. We are going to use the download_trained_model bash scripts which is already available in the pifuhd directory. Let’s change our directory back to pifuhd.
!cd /spell/pifuhd !sh ./scripts/download_trained_model.sh
We are finally ready to run the model! Here's how:
!python -m apps.simple_test -r 256 —use_rect -i $image_dir
This will load the model and run it on our image, saving the output file to disk. To render the 3D model file, we will use pytorch3D:
!pip install 'git+https://github.com/facebookresearch/pytorch3d.git@stable'
from lib.colab_util import generate_video_from_obj, set_renderer, video renderer = set_renderer() generate_video_from_obj(obj_path, out_img_path, video_path, renderer) !ffmpeg -i $video_path -vcodec libx264 $video_display_path -y -loglevel quiet video(video_display_path)
This will generate a video for us:
Here is the result!