EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision
About
We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Reconstruction | nuScenes | PSNR26.75 | 17 | |
| Novel View Synthesis | nuScenes Shift ± 2 v1.0-trainval (test) | FID52.03 | 14 | |
| Surrounding View Synthesis | NuScenes v1.0 (test) | PSNR26.75 | 11 | |
| RGB Reconstruction | nuScenes (val) | PSNR30.88 | 10 | |
| Out-of-path View Synthesis | CARLA (out-of-path) | PSNR21.18 | 8 | |
| Depth Estimation | nuScenes Sparse LiDAR GT official (val) | Abs Rel Error0.073 | 7 | |
| RGB Novel-View Synthesis | nuScenes (val) | PSNR20.91 | 7 | |
| Novel View Synthesis | Waymo Open Dataset 12 scenes | PSNR26.12 | 7 | |
| Scene Reconstruction | Waymo Open Dataset 12 scenes | PSNR27.15 | 7 | |
| View Synthesis | Waymo Static scenes (test) | PSNR30.15 | 7 |