SHERF: Generalizable Human NeRF from a Single Image
About
Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | THuman 2.0 (test) | LPIPS0.11 | 39 | |
| Human Novel View Synthesis | HuMMan | PSNR20.83 | 6 | |
| Human Image Synthesis | HuGe100K in-the-wild (test) | User Preference Score28.75 | 5 | |
| Human Motion and View Synthesis | HuGe100K (user study) | Identity & Appearance Preservation13.55 | 5 | |
| Novel View Synthesis | THuman in-domain 2.0 (test) | PSNR19.25 | 5 | |
| Novel View Synthesis | RenderPeople in-domain 1.0 (test) | PSNR23.38 | 5 | |
| Novel View Synthesis | ZJU-MoCap in-domain 1.0 (test) | PSNR27.81 | 5 | |
| Novel View Synthesis | SynBody (test) | PSNR15.189 | 4 | |
| 3D Human Disentanglement | SynBody (test) | CLIP Score (Overall)0.766 | 4 | |
| 3D Human Disentanglement | CloSe (test) | CLIP Score (All)0.777 | 4 |