Single Image Human Shape Reconstruction

In this short tutorial we are going to take a look at the amazing PIFuHD single image-based 3D human shape reconstruction model. We will use a Jupyter Lab pipeline in a Spell workplace to turn some custom images into 3D models using a the pre-trained model from the PIFuHD GitHub repo (from the paper "Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization").

Sign up for a free Spell account and get 10$ worth of GPU credits you can use towards this tutorial.


We will create and use a Spell workspace for this project. Go to the workspaces page, create a new workspace with a suitable name, and set the GitHub URL to point to the PIFuHD repo:

On the next page we will fill out our other environment configuration details: the machine type (let's use a K80 GPU), framework (let's use default), pip requirements (we will need the scikit-image, tqdm, and pycocotools packages), and apt requirements (we will need ffmpeg).

Click through to finish set up and drop into the Jupyter Lab demo notebook!

Now, before we go any further, we will need to create a sample_image folder on the root directory to store our sample images. We will use this folder to upload some test images for the model.

To test the model, you will need high-resolution image of humans. There are a few tips from the authors mentioned to get better results:

  • Use a high-res image. The model is trained with 1024x1024 images. Use at least 512x512 with fine-details. Low-res images and JPEG artifacts may result in unsatisfactory results.
  • Use an image with a single person. If the image contains multiple people, reconstruction quality is likely degraded.
  • Front-facing with standing works best (or with fashion pose).
  • The entire body is covered within the image (missing legs are partially supported).
  • Make sure the input image is well lit. Extremely dark or bright images and strong shadows often create artifacts.
  • I recommend a nearly parallel camera angle to the ground. High camera height may result in distorted legs or high heels.
  • If the background is cluttered, use a less complex background or try removing it using before processing.
  • It's trained with humans only. Anime characters may not work well.
  • Search on twitter with #pifuhd tag to get a better sense of what succeeds and what fails.

For the purposes of this article, we will use the following high-resolution human pose test image:


But you can use any image you'd like.

Trying it out

First, we need to do some path munging to get our desired input and output file path.

import os
filename = 'img1.png'
image_path = '/spell/sample_images/%s' % filename
image_dir = os.path.dirname(image_path)
file_name = os.path.splitext(os.path.basename(image_path))[0]
# output paths
obj_path = '/spell/pifuhd/results/pifuhd_final/recon/result_%s_256.obj' % file_name
out_img_path = '/spell/pifuhd/results/pifuhd_final/recon/result_%s_256.png' % file_name
video_path = '/spell/pifuhd/results/pifuhd_final/recon/result_%s_256.mp4' % file_name
video_display_path = '/spell/pifuhd/results/pifuhd_final/result_%s_256_display.mp4' % file_name

Once it is done, our next task is to pre-process the image to make it ready to fit in the model. We are going to download some scripts and pre-trained models to preprocess our image data (note: you can run bash code in Jupyter Lab using the ! prefix, and we are using that feature here):

!git clone
cd lightweight-human-pose-estimation.pytorch/

The next thing we need to do is crop our example image to fit the expected (square) shape. Here's the function we will use:

import torch
import cv2
import numpy as np
from models.with_mobilenet import PoseEstimationWithMobileNet
from modules.keypoints import extract_keypoints, group_keypoints
from modules.load_state import load_state
from modules.pose import Pose, track_poses
import demo

def get_rect(net, images, height_size):
    net = net.eval()
    stride = 8
    upsample_ratio = 4
    num_keypoints = Pose.num_kpts
    previous_poses = []
    delay = 33
    for image in images:
        rect_path = image.replace('.%s' % (image.split('.')[-1]), '_rect.txt')
        img = cv2.imread(image, cv2.IMREAD_COLOR)
        orig_img = img.copy()
        orig_img = img.copy()
        heatmaps, pafs, scale, pad = demo.infer_fast(net, img, height_size, stride, upsample_ratio, cpu=False)
total_keypoints_num = 0
        all_keypoints_by_type = []
        for kpt_idx in range(num_keypoints):  # 19th for bg
            total_keypoints_num += extract_keypoints(heatmaps[:, :, kpt_idx], all_keypoints_by_type, total_keypoints_num)
pose_entries, all_keypoints = group_keypoints(all_keypoints_by_type, pafs, demo=True)
        for kpt_id in range(all_keypoints.shape[0]):
            all_keypoints[kpt_id, 0] = (all_keypoints[kpt_id, 0] * stride / upsample_ratio - pad[1]) / scale
            all_keypoints[kpt_id, 1] = (all_keypoints[kpt_id, 1] * stride / upsample_ratio - pad[0]) / scale
        current_poses = []
rects = []
        for n in range(len(pose_entries)):
            if len(pose_entries[n]) == 0:
            pose_keypoints = np.ones((num_keypoints, 2), dtype=np.int32) * -1
            valid_keypoints = []
            for kpt_id in range(num_keypoints):
                if pose_entries[n][kpt_id] != -1.0:  # keypoint was found
                    pose_keypoints[kpt_id, 0] = int(all_keypoints[int(pose_entries[n][kpt_id]), 0])
                    pose_keypoints[kpt_id, 1] = int(all_keypoints[int(pose_entries[n][kpt_id]), 1])
                    valid_keypoints.append([pose_keypoints[kpt_id, 0], pose_keypoints[kpt_id, 1]])
            valid_keypoints = np.array(valid_keypoints)
            if pose_entries[n][10] != -1.0 or pose_entries[n][13] != -1.0:
              pmin = valid_keypoints.min(0)
              pmax = valid_keypoints.max(0)
              center = (0.5 * (pmax[:2] + pmin[:2])).astype(
              radius = int(0.65 * max(pmax[0]-pmin[0], pmax[1]-pmin[1]))
            elif pose_entries[n][10] == -1.0 and pose_entries[n][13] == -1.0 and pose_entries[n][8] != -1.0 and pose_entries[n][11] != -1.0:
              # if leg is missing, use pelvis to get cropping
              center = (0.5 * (pose_keypoints[8] + pose_keypoints[11])).astype(
              radius = int(1.45*np.sqrt(((center[None,:] - valid_keypoints)**2).sum(1)).max(0))
              center[1] += int(0.05*radius)
              center = np.array([img.shape[1]//2,img.shape[0]//2])
              radius = max(img.shape[1]//2,img.shape[0]//2)
            x1 = center[0] - radius
            y1 = center[1] - radius
    rects.append([x1, y1, 2*radius, 2*radius])
    np.savetxt(rect_path, np.array(rects), fmt='%d')

Now run the preprocessing job:

net = PoseEstimationWithMobileNet()
checkpoint = torch.load('checkpoint_iter_370000.pth', map_location='cpu')
load_state(net, checkpoint)
get_rect(net.cuda(), [image_path], 512)

Now we have to download the pre-trained model. We are going to use the download_trained_model bash scripts which is already available in the pifuhd directory. Let’s change our directory back to pifuhd.

!cd /spell/pifuhd
!sh ./scripts/

We are finally ready to run the model! Here's how:

!python -m apps.simple_test -r 256 —use_rect -i $image_dir

Viewing Output

This will load the model and run it on our image, saving the output file to disk. To render the 3D model file, we will use pytorch3D:

!pip install 'git+'
from lib.colab_util import generate_video_from_obj, set_renderer, video
renderer = set_renderer()
generate_video_from_obj(obj_path, out_img_path, video_path, renderer)
!ffmpeg -i $video_path -vcodec libx264 $video_display_path -y -loglevel quiet

This will generate a video for us:

Here is the result!

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.