InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception
About
3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: https://lhj-git.github.io/InstanceGaussian/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Semantic Segmentation | ScanNet++ | mIoU (20 classes)29.98 | 31 | |
| 3D object selection | LERF-OVS | mIoU (Mean)45.3 | 17 | |
| 3D Semantic Segmentation | ScanNet | mIoU (10 classes)0.2977 | 17 | |
| 3D Semantic Segmentation | ScanNet V2 | mIoU34.14 | 16 | |
| Open-vocabulary 3D object selection | LERF | Ramen Score24.6 | 16 | |
| 3D object selection | LERF figurines scene | Peak VRAM24 | 14 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 19 classes | mIoU40.7 | 12 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 15 classes | mIoU42.5 | 12 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 10 classes | mIoU47.9 | 12 | |
| 3D Semantic Segmentation | ScanNet200 | mIoU (70 classes)23.2 | 11 |