Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

About

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Mayank Mayank, Bharanidhar Duraisamy, Florian Gei{\ss}, Abhinav Valada• 2026

Related benchmarks

Task	Dataset	Result	Rank
3D Object Detection	View-of-Delft (VoD) Entire Annotated Area (val)	mAP3D48.92		115
3D Object Detection	View-of-Delft (VoD) In Driving Corridor (val)	AP3D (Car)72.21		81

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord