Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

About

In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on nuScenes test set while maintaining fast inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https://github.com/junjie18/CMT.

Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai Wang, Xiangyu Zhang• 2023

Related benchmarks

Task	Dataset	Result
3D Object Detection	nuScenes (val)	NDS72.9	981
3D Object Detection	nuScenes (test)	mAP72	903
3D Object Detection	NuScenes v1.0 (test)	mAP72	230
3D Object Detection	nuScenes (val)	NDS46	217
3D Object Detection	nuScenes v1.0 (val)	mAP (Overall)70.3	207
3D Object Detection	nuScenes v1.0-trainval (val)	NDS72.9	182
3D Object Detection	Argoverse 2 (val)	mAP36.1	101
3D Object Detection	nuScenes LiDAR Beamsreduce	NDS60.1	41
3D Object Detection	nuScenes Night (val)	mAP42.8	26
3D Object Detection	nuScenes LiDAR Motionblur	NDS63.93	24

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord