MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
About
Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 mAP@0.5 on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | S3DIS (Area 5) | mIOU71.9 | 799 | |
| Semantic segmentation | ScanNet V2 (val) | mIoU26.7 | 288 | |
| 3D Semantic Segmentation | ScanNet V2 (val) | mIoU72.8 | 171 | |
| 3D Visual Grounding | ScanRefer (val) | -- | 155 | |
| 3D Object Detection | ScanNet | mAP@0.2563.1 | 123 | |
| 3D Object Detection | SUN RGB-D | mAP@0.2560.6 | 104 | |
| Semantic segmentation | S3DIS | mIoU1.5 | 88 | |
| Semantic segmentation | ScanNet (test) | mIoU72.8 | 59 | |
| Semantic segmentation | ScanNet | mIoU26.7 | 59 | |
| 3D Object Detection | ScanNet V2 | AP5048.9 | 54 |