ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
About
Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-vocabulary 3D object selection | LERF | Ramen Score45.6 | 16 | |
| 3D object selection | LERF figurines scene | Peak VRAM8 | 14 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 19 classes | mIoU45.5 | 12 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 15 classes | mIoU47.2 | 12 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 10 classes | mIoU53.7 | 12 |