Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

About

3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.

Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari• 2025

Related benchmarks

TaskDatasetResultRank
2D Open-Vocabulary QueryLERF-OVS
Mean mIoU55.49
7
3D Open-Vocabulary QueryLERF-OVS
Mean mIoU36.71
7
Open-vocabulary 3D Scene UnderstandingFigurines scene
Memory (GB)14
3
Showing 3 of 3 rows

Other info

Follow for update