DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
About
We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| RGB Reconstruction | nuScenes (val) | PSNR30.11 | 10 | |
| RGB Novel-View Synthesis | nuScenes (val) | PSNR20.78 | 7 | |
| Depth Estimation | nuScenes Sparse LiDAR GT official (val) | Abs Rel Error0.223 | 7 | |
| Depth Estimation | nuScenes Dense Depth GT (val) | Abs Rel0.228 | 6 | |
| Camera Reconstruction | nuScenes (train) | PSNR28.01 | 5 | |
| RGB Reconstruction | Waymo NOTR (full) | PSNR29.84 | 4 | |
| Foundation Feature Reconstruction | nuScenes (val) | CLIP PSNR18.69 | 2 |