LERF: Language Embedded Radiance Fields
About
Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or actionable affordances. In this work we propose Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, which has potential use cases in robotics, understanding vision-language models, and interacting with 3D scenes. LERF enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume. The project website can be found at https://lerf.io .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Visual Grounding | ScanRefer (val) | Overall Accuracy @ IoU 0.500.9 | 155 | |
| 3D Semantic Segmentation | 3D-OVS | Bed86.9 | 20 | |
| Open-Vocabulary 3D Scene Segmentation | LeRF-mask | Figurines mIoU33.5 | 17 | |
| 3D Semantic Segmentation | LERF (test) | mIoU37.4 | 13 | |
| Novel View Synthesis | Mip-NeRF360 (novel views) | PSNR25.749 | 12 | |
| 3D Scene Reconstruction | LERF average across four scenes | PSNR20.75 | 12 | |
| Object Segmentation | LERF-Mask 1.0 (test) | mIoU (mean)37.2 | 10 | |
| Semantic segmentation | ScanNet 12 scenes (val) | mIoU31.2 | 9 | |
| 3D Open-vocabulary Segmentation | LERF-style Dataset bed scene (test) | mIoU73.5 | 8 | |
| 3D Open-vocabulary Segmentation | LERF-style Dataset lawn scene (test) | mIoU73.7 | 8 |