PanSt3R: Multi-view Consistent Panoptic Segmentation
About
Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Instance Segmentation | ScanNet V2 (val) | Average AP5029.3 | 195 | |
| 3D Instance Segmentation | ScanNet++ V1 (val) | AP5015.9 | 12 | |
| 3D Semantic Segmentation | ScanNet 3 (val) | mIoU42.6 | 11 | |
| 3D Instance Segmentation | ScanNet200 v2 (val) | mAP (%)10.6 | 10 | |
| 3D Semantic Segmentation | ScanNet200 42 (val) | mIoU13.3 | 9 | |
| 3D Semantic Segmentation | ScanNet++ 57 (val) | mIoU21.6 | 5 |