GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

About

3D Semantic Occupancy Prediction is fundamental for spatial understanding, yet existing approaches face challenges in scalability and generalization due to their reliance on extensive labeled data and computationally intensive voxel-wise representations. In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. GaussTR predicts sparse sets of Gaussians in a feed-forward manner to represent 3D scenes. By splatting the Gaussians into 2D views and aligning the rendered features with foundation models, GaussTR facilitates self-supervised 3D representation learning and enables open-vocabulary semantic occupancy prediction without requiring explicit annotations. Empirical experiments on the Occ3D-nuScenes dataset demonstrate GaussTR's state-of-the-art zero-shot performance of 12.27 mIoU, along with a 40% reduction in training time. These results highlight the efficacy of GaussTR for scalable and holistic 3D spatial understanding, with promising implications in autonomous driving and embodied agents. The code is available at https://github.com/hustvl/GaussTR.

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tianwei Lin, Zhizhong Su, Wenyu Liu, Xinggang Wang• 2024

Related benchmarks

Task	Dataset	Result
3D Occupancy Prediction	Occ3D-nuScenes (val)	mIoU12.27	215
3D Semantic Occupancy Prediction	Occ3D-nuScenes v1.0 (val)	mIoU44.5	54
Semantic Occupancy Prediction	Occ3D-nuScenes	IoU (Semantic)13.9	12
Occupancy Prediction	EmbodiedOcc-ScanNet 1.0 (test)	Overall IoU15.63	10
Semantic Occupancy Estimation	Occ3D-nuScenes	mIoU13.8	9
3D Occupancy Prediction	Occ3D-NuScenes (in-domain)	mIoU45.19	5
Semantic Occupancy Prediction	ReplicaOcc (test)	IoU15.01	5

Showing 7 of 7 rows

Other info

Code

Follow for update

@wizwand_team Discord