BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Gengshan Yang2 Minh Vo3 Natalia Neverova1
Deva Ramanan2 Andrea Vedaldi1 Hanbyul Joo1
1Meta AI 2Carnegie Mellon University 3Meta Reality Labs


Given multiple casual videos capturing a deformable object, BANMo reconstructs an animatable 3D model, including an implicit canonical 3D shape, appearance, skinning weights, and time-varying articulations, without pre-defined shape templates or registered cameras. Left: Input videos; Middle: 3D shape, bones, and skinning weights (visualized as surface colors) in the canonical space; Right: Posed reconstruction at each time instance with color and canonical embeddings (correspondences are shown as the same colors).

Left: Input videos; Right: Reconstruction at each time instance. Correspondences are shown as the same colors.

Abstract

Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses.

[Arxiv] [Code (coming soon)]

Video

Bibtex

@Article{yang2021banmo, title={BANMo: Building Animatable 3D Neural Models from Many Casual Videos}, author={Yang, Gengshan and Vo, Minh and Neverova Natalia and Ramanan, Deva and Vedaldi Andrea and Joo Hanbyul}, journal = {arXiv preprint arXiv:2112.12761}, year={2021} }

Comparison

>> [Click for more]
Reference
BANMo
Nerfies
ViSER
Comparison on Causal-cat-4. Top: reconstructed 3D shape. Bottom: reconstructed 3D shape at 1st frame.

Results

Casual-cat (10 vids) >> [Click for more]
Casual-cat-0. Top left: reference image overlayed with input densepose features. Top middle: reconstructed 1st frame shape. Top right: recovered articulations in the canoincal space. Bottom row: reconstruction from front/side/top viewpoints. Correspondences are shown as the same color.
Canonical rest shape:


Casual-human (10 vids) >> [Click for more]
Casual-human-5. Top left: reference image overlayed with input densepose features. Top middle: reconstructed 1st frame shape. Top right: recovered articulations in the canoincal space. Bottom row: reconstruction from front/side/top viewpoints. Correspondences are shown as the same color.
Canonical rest shape:



AMA (2 unique actions out of 16 vids) >> [Click for more]
AMA-swing. Top left: reference image overlayed with input densepose features. Top middle: reconstructed 1st frame shape. Top right: recovered articulations in the canoincal space. Bottom row: reconstruction from front/side/top viewpoints. Correspondences are shown as the same color.


Synthetic eagle (5 vids) >> [Click for more]
Eagle-0. Top left: reference image overlayed with input densepose features. Top middle: reconstructed 1st frame shape. Top right: recovered articulations in the canoincal space. Bottom row: reconstruction from front/side/top viewpoints. Correspondences are shown as the same color.


Synthetic hands (5 vids) >> [Click for more]
Hands-0. Top left: reference image overlayed with input densepose features. Top middle: reconstructed 1st frame shape. Top right: recovered articulations in the canoincal space. Bottom row: reconstruction from front/side/top viewpoints. Correspondences are shown as the same color.


Related projects

Video-based, template-free deformable shape reconstruction:
ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction. NeurIPS 2021.
LASR: Learning Articulated Shape Reconstruction from a Monocular Video. CVPR 2021.
DOVE: Learning Deformable 3D Objects by Watching Videos. arXiv preprint.
Image-based deformable shape reconstruction:
To The Point: Correspondence-driven monocular 3D category reconstruction. NeurIPS 2021.
Self-supervised Single-view 3D Reconstruction via Semantic Consistency. ECCV 2020.
Shape and Viewpoints without Keypoints. ECCV. 2020.
Articulation Aware Canonical Surface Mapping. CVPR 2020.
Learning Category-Specific Mesh Reconstruction from Image Collections. ECCV 2018.

Acknowledgments

Work done when interning at Meta AI. Gengshan Yang is supported by the Qualcomm Innovation Fellowship. Thanks to Shubham Tulsiani, Jason Zhang, and Ignacio Rocco for helpful feedback and discussions, and Vasil Khalidov for help settting up DensePose-CSE color visualization.

Webpage design borrowed from Peiyun Hu