MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection
About
Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | nuScenes (val) | NDS77.9 | 941 | |
| 3D Object Detection | NuScenes v1.0 (test) | mAP74.7 | 210 | |
| 3D Object Detection | Argoverse 2 (val) | mAP51.2 | 62 | |
| 3D Object Detection | nuScenes-C (val) | mAP (Clean)74.9 | 9 |