NPC: Neural Point Characters from Video
ICCV 2023



High-fidelity human 3D models can now be learned directly from videos, typically by combining a template-based surface model with neural representations. However, obtaining a template surface requires expensive multi-view capture systems, laser scans, or strictly controlled conditions. Previous methods avoid using a template but rely on a costly or ill-posed mapping from observation to canonical space. We propose a hybrid point-based representation for reconstructing animatable characters that does not require an explicit surface model, while being generalizable to novel poses. For a given video, our method automatically produces an explicit set of 3D points representing approximate canonical geometry, and learns an articulated deformation model that produces pose-dependent point transformations. The points serve both as a scaffold for high-frequency neural features and an anchor for efficiently mapping between observation and canonical space. We demonstrate on established benchmarks that our representation overcomes limitations of prior work operating in either canonical or in observation space. Moreover, our automatic point extraction approach enables learning models of human and animal characters alike, matching the performance of the methods using rigged surface templates despite being more general.

We blur all faces for anonymity.




NPC produces a volume rendering of a character with a NeRF Fψ locally conditioned on features aggregated from a dynamically deformed point cloud. Given a raw video, we first estimate a canonical point cloud p with an implicit body model. GNN then deforms canonical points p conditioned on skeleton pose θ, and produces a set of pose-dependent per-point features. Every 3D query point qo in the observation space aggregates the features from k-nearest neighbors in the posed point cloud. The aggregated feature is passed into Fψ for the volume rendering. Our model is supervised directly with input videos.

Point Feature Encoding


Our core idea is to employ a point cloud p as an anchor to carry features from the canonical to the observation space, forming an efficient mapping between the two. (1) Each p carries a learnable feature fp and its position queries features fs from a canonical field. (2) The GNN adds pose-dependent features fθ and deformation Δp. (3) The view direction and distance is added in bone-relative space. (4) The k-nearest neighbors of qo are used to establish forward and backward mapping from a query point to both posed and canonical points.



All data sourcing, modeling codes, and experiments were developed at University of British Columbia. Meta did not obtain the data/codes or conduct any experiments in this work.






We thank Shaofei Wang and Ruilong Li for helpful discussions related to ARAH and TAVA. We thank Luis A. Bolaños for his help and discussions, and Frank Yu, Chunjin Song, Xingzhe He and Eric Hedlin for their insightful feedback. We also thank Advanced Research Computing at the University of British Columbia and Compute Canada for providing computational resources.
The website template was borrowed from Michaël Gharbi.