Fast Point Transformer
About
The recent success of neural networks enables a better interpretation of 3D point clouds, but processing a large-scale 3D scene remains a challenging problem. Most current approaches divide a large-scale scene into small regions and combine the local predictions together. However, this scheme inevitably involves additional stages for pre- and post-processing and may also degrade the final output due to predictions in a local perspective. This paper introduces Fast Point Transformer that consists of a new lightweight self-attention layer. Our approach encodes continuous 3D coordinates, and the voxel hashing-based architecture boosts computational efficiency. The proposed method is demonstrated with 3D semantic segmentation and 3D detection. The accuracy of our approach is competitive to the best voxel-based method, and our network achieves 129 times faster inference time than the state-of-the-art, Point Transformer, with a reasonable accuracy trade-off in 3D semantic segmentation on S3DIS dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | S3DIS (Area 5) | mIOU71 | 799 | |
| Semantic segmentation | ScanNet V2 (val) | mIoU72.4 | 288 | |
| 3D Semantic Segmentation | ScanNet V2 (val) | mIoU72.1 | 171 | |
| 3D Semantic Segmentation | ScanNet (val) | -- | 100 | |
| 3D Object Detection | ScanNet (val) | mAP@0.2559.1 | 66 | |
| 3D Semantic Segmentation | S3DIS Area 5 (test) | mIoU (%)70.3 | 32 | |
| 3D Semantic Segmentation | ScanNet20 v2 (val) | mIoU72.1 | 13 |