Home Abstract Overview BibTex


We present an analysis-by-synthesis approach for monocular motion capture that learns a volumetric body model and refines the 3D pose estimation of the user in a self-supervised manner.
(Left) Articulated human representation learned from synthetic dataset and rendered in unseen poses/novel views.
(Right) Pose refinment on Human3.6M dataset. Faces are blurred out for anonimity.


While deep learning has reshaped the classical motion capture pipeline, generative, analysis-by-synthesis elements are still in use to recover fine details if a high-quality 3D model of the user is available. Unfortunately, obtaining such a model for every user a priori is challenging, time-consuming, and limits the application scenarios. We propose a novel test-time optimization approach for monocular motion capture that learns a volumetric body model of the user in a self-supervised manner. To this end, our approach combines the advantages of neural radiance fields with an articulated skeleton representation. Our proposed skeleton embedding serves as a common reference that links constraints across time, thereby reducing the number of required camera views from traditionally dozens of calibrated cameras, down to a single uncalibrated one. As a starting point, we employ the output of an off-the-shelf model that predicts the 3D skeleton pose. The volumetric body shape and appearance is then learned from scratch, while jointly refining the initial pose estimate. Our approach is self-supervised and does not require any additional ground truth labels for appearance, pose, or 3D shape. We demonstrate that our novel combination of a discriminative pose estimation technique with surface-free analysis-by-synthesis outperforms purely discriminative monocular pose estimation approaches and generalizes well to multiple views.


Our human body model (bottom left) is learned using a photometric reconstruction loss. First, the skeleton pose is initialized with an off-the-shelf pose estimator (gray arrows). Second, this pose is refined via analysis-by-synthesis using volumetric rendering (the step after NeRF) of a neural radiance field (green). Key is a skeleton-relative embedding that links the neural encoding with skeleton pose and enables their joint learning (blue).


Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge Rhodin. "A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering", Arxiv preprint, 2021


        title={A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering},
        author={Su, Shih-Yang and Yu, Frank and Zollhoefer, Michael and Rhodin, Helge},
        journal={arXiv preprint arXiv:2102.06199},