Tackling View-Dependent Semantics in 3D Language Gaussian Splatting
About
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints--a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/SJTU-DeepVisionLab/LaGa.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Semantic Segmentation | ScanNet 15 classes | mIoU35.5 | 17 | |
| 3D Semantic Segmentation | ScanNet 10 classes | mIoU42.6 | 17 | |
| Open-vocabulary 3D object selection | LERF | Ramen Score61.4 | 16 | |
| 3D object selection | LERF figurines scene | Peak VRAM24 | 14 | |
| Semantic segmentation | ScanNet 19 classes | mIoU32.5 | 13 | |
| Open Vocabulary Semantic Segmentation | LERF-OVS | mIoU64 | 12 | |
| Open-Vocabulary Segmentation | 3D-OVS corrected (test) | mIoU (Bed)96.8 | 5 |