Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
About
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Semantic Segmentation | ScanNet++ | mIoU (20 classes)2.93 | 31 | |
| 3D Segmentation | Mip-NeRF 360 | mIoU29.1 | 31 | |
| Novel View Reconstruction | HyperNeRF held-out 4D LangSplat (test) | Americano Score16.48 | 20 | |
| Novel View Reconstruction | HyperNeRF 4D LangSplat (test) | Americano Score63 | 20 | |
| 3D Semantic Segmentation | 3D-OVS | Bed84.9 | 20 | |
| 3D Semantic Segmentation | ScanNet | mIoU (10 classes)9.84 | 17 | |
| 3D object selection | LERF-OVS | mIoU (Mean)17.42 | 17 | |
| Open-Vocabulary 3D Scene Segmentation | LeRF-mask | Figurines mIoU60.3 | 17 | |
| Open-vocabulary 3D object selection | LERF | Ramen Score46 | 16 | |
| 3D object selection | LERF figurines scene | Peak VRAM20 | 14 |