ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

About

Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.

Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li• 2025

Related benchmarks

Task	Dataset	Result
Open-vocabulary 3D object selection	LERF	Ramen Score45.6	16
3D object selection	LERF figurines scene	Peak VRAM8	14
Open-Vocabulary 3D Semantic Segmentation	ScanNet 19 classes	mIoU45.5	12
Open-Vocabulary 3D Semantic Segmentation	ScanNet 15 classes	mIoU47.2	12
Open-Vocabulary 3D Semantic Segmentation	ScanNet 10 classes	mIoU53.7	12

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord