Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

About

Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works. More results are available at https://vala3d.github.io

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, Stefano Gasperini• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ScanNet 19 classes	mIoU32.11	23
3D Semantic Segmentation	ScanNet 10 classes	mIoU46.21	17
3D Semantic Segmentation	ScanNet 15 classes	mIoU35.1	17
3D Object Localization	LERF	Ramen Success Rate75.6	14
Open Vocabulary Semantic Segmentation	LERF-OVS	mIoU61.7	12
3D Semantic Segmentation	LERF	mIoU (Ramen)51.5	9
3D Open-vocabulary Segmentation	ScanNet V2	mIoU (19 classes)32.11	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord