RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

About

Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi-modal sensors to collect comprehensive environmental data. Among these, the radar-camera multi-modal perception system is especially favored for its excellent sensing capabilities and cost-effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar-camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera-based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's-eye-view (BEV) feature using a dual-stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query-based multi-view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera-based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state-of-the-art radar-camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test-time augmentation or model ensembling.

Zhiwei Lin, Zhe Liu, Yongtao Wang, Le Zhang, Ce Zhu• 2024

Related benchmarks

Task	Dataset	Result
3D Object Detection	nuScenes (val)	NDS60.4	981
3D Object Detection	nuScenes (test)	mAP67.3	903
3D Object Detection	nuScenes (val)	NDS60.4	217
3D Multi-Object Tracking	nuScenes (test)	--	139
BEV Semantic Segmentation	nuScenes (val)	Drivable Area IoU82.7	55
BeV Segmentation	nuScenes (val)	--	16

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord