Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

About

We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes and Waymo NOTR datasets demonstrate that DistillNeRF significantly outperforms existing comparable state-of-the-art self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus• 2024

Related benchmarks

TaskDatasetResultRank
3D Semantic Occupancy PredictionOcc3D-nuScenes v1.0 (val)
mIoU29.1
26
RGB ReconstructionnuScenes (val)
PSNR30.11
21
Semantic Occupancy EstimationOcc3D-nuScenes
mIoU10.1
9
RGB Novel-View SynthesisnuScenes (val)
PSNR20.78
7
Depth EstimationnuScenes Sparse LiDAR GT official (val)
Abs Rel Error0.223
7
Depth EstimationnuScenes Dense Depth GT (val)
Abs Rel0.228
6
Camera ReconstructionnuScenes (train)
PSNR28.01
5
3D Occupancy PredictionOcc3D-NuScenes (in-domain)
mIoU29.11
5
RGB ReconstructionWaymo NOTR (full)
PSNR29.84
4
Foundation Feature ReconstructionnuScenes (val)
CLIP PSNR18.69
2
Showing 10 of 10 rows

Other info

Code

Follow for update