Voxel Field Fusion for 3D Object Detection
About
In this work, we present a conceptually simple yet effective framework for cross-modality 3D object detection, named voxel field fusion. The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field. To this end, the learnable sampler is first designed to sample vital features from the image plane that are projected to the voxel grid in a point-to-ray manner, which maintains the consistency in feature representation with spatial context. In addition, ray-wise fusion is conducted to fuse features with the supplemental context in the constructed voxel field. We further develop mixed augmentor to align feature-variant transformations, which bridges the modality gap in data augmentation. The proposed framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets. Code is made available at https://github.com/dvlab-research/VFF.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | nuScenes (test) | mAP68.4 | 829 | |
| 3D Instance Segmentation | ScanNet V2 (val) | Average AP5064.3 | 195 | |
| 3D Instance Segmentation | S3DIS (Area 5) | mAP@50% IoU59.3 | 106 | |
| 3D Object Detection | KITTI (test) | 3D AP Easy89.58 | 61 | |
| 3D Object Detection | KITTI (val) | -- | 24 | |
| 3D Instance Segmentation | ScanNet (test) | mAP50.6 | 15 |