InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

About

3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: https://lhj-git.github.io/InstanceGaussian/

Haijie Li, Yanmin Wu, Jiarui Meng, Qiankun Gao, Zhiyao Zhang, Ronggang Wang, Jian Zhang• 2024

Related benchmarks

Task	Dataset	Result
3D Semantic Segmentation	ScanNet++	mIoU (20 classes)29.98	31
3D object selection	LERF-OVS	mIoU (Mean)45.3	21
3D Semantic Segmentation	ScanNet	mIoU (10 classes)0.2977	17
3D Semantic Segmentation	ScanNet V2	mIoU34.14	16
Open-vocabulary 3D object selection	LERF	Ramen Score24.6	16
3D object selection	LERF figurines scene	Peak VRAM24	14
Open-Vocabulary 3D Semantic Segmentation	ScanNet 19 classes	mIoU40.7	12
Open-Vocabulary 3D Semantic Segmentation	ScanNet 15 classes	mIoU42.5	12
Open-Vocabulary 3D Semantic Segmentation	ScanNet 10 classes	mIoU47.9	12
3D Semantic Segmentation	ScanNet200	mIoU (70 classes)23.2	11

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord