Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
About
We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large-scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. For video results, please visit : https://drsplat.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-vocabulary 3D Scene Understanding | LERF | Feature Distillation Time (h)10 | 7 | |
| 3D object selection | LERF-OVS | mIoU (Waldo Kitchen)29.37 | 5 | |
| Open-vocabulary point cloud understanding | ScanNet 19 classes | mIoU28.4 | 5 | |
| Open-vocabulary point cloud understanding | ScanNet 15 classes | mIoU32.67 | 5 | |
| 3D Referring Segmentation | ScanNet curated (test) | 3D mIoU10.56 | 5 | |
| Novel-view Panoptic Segmentation | Neu3D coffee martini | mAcc (Pixel)88.37 | 5 | |
| Novel-view Panoptic Segmentation | Neu3D flame salmon | mAcc (Pixel)81.22 | 5 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 19 classes v2 | mIoU23.21 | 5 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 15 classes v2 | mIoU25.33 | 5 | |
| Open-Vocabulary 3D Semantic Segmentation | ScanNet 10 classes v2 | mIoU36.71 | 5 |