Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

About

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Mayank Mayank, Bharanidhar Duraisamy, Florian Gei{\ss}, Abhinav Valada• 2026

Related benchmarks

TaskDatasetResultRank
3D Object DetectionView-of-Delft (VoD) Entire Annotated Area (val)
mAP3D48.92
115
3D Object DetectionView-of-Delft (VoD) In Driving Corridor (val)
AP3D (Car)72.21
81
Showing 2 of 2 rows

Other info

Follow for update