MVFusion: Multi-View 3D Object Detection with Semantic-aligned Radar and Camera Fusion
About
Multi-view radar-camera fused 3D object detection provides a farther detection range and more helpful features for autonomous driving, especially under adverse weather. The current radar-camera fusion methods deliver kinds of designs to fuse radar information with camera data. However, these fusion approaches usually adopt the straightforward concatenation operation between multi-modal features, which ignores the semantic alignment with radar features and sufficient correlations across modals. In this paper, we present MVFusion, a novel Multi-View radar-camera Fusion method to achieve semantic-aligned radar features and enhance the cross-modal information interaction. To achieve so, we inject the semantic alignment into the radar features via the semantic-aligned radar encoder (SARE) to produce image-guided radar features. Then, we propose the radar-guided fusion transformer (RGFT) to fuse our radar and image features to strengthen the two modals' correlation from the global scope via the cross-attention mechanism. Extensive experiments show that MVFusion achieves state-of-the-art performance (51.7% NDS and 45.3% mAP) on the nuScenes dataset. We shall release our code and trained networks upon publication.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | nuScenes (val) | NDS45.5 | 941 | |
| 3D Object Detection | nuScenes (test) | mAP45.3 | 829 | |
| 3D Object Detection | NuScenes v1.0 (test) | mAP45.3 | 210 | |
| 3D Object Detection | nuScenes v1.0 (val) | mAP (Overall)42.1 | 190 | |
| 3D Object Detection | nuScenes v1.0-trainval (val) | NDS45.5 | 87 |