MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
About
Monocular 3D object detection is an important yet challenging task in autonomous driving. Some existing methods leverage depth information from an off-the-shelf depth estimator to assist 3D detection, but suffer from the additional computational burden and achieve limited performance caused by inaccurate depth priors. To alleviate this, we propose MonoDTR, a novel end-to-end depth-aware transformer network for monocular 3D object detection. It mainly consists of two components: (1) the Depth-Aware Feature Enhancement (DFE) module that implicitly learns depth-aware features with auxiliary supervision without requiring extra computation, and (2) the Depth-Aware Transformer (DTR) module that globally integrates context- and depth-aware features. Moreover, different from conventional pixel-wise positional encodings, we introduce a novel depth positional encoding (DPE) to inject depth positional hints into transformers. Our proposed depth-aware modules can be easily plugged into existing image-only monocular 3D object detectors to improve the performance. Extensive experiments on the KITTI dataset demonstrate that our approach outperforms previous state-of-the-art monocular-based methods and achieves real-time detection. Code is available at https://github.com/kuanchihhuang/MonoDTR
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | KITTI car (test) | AP3D (Easy)21.99 | 195 | |
| 3D Object Detection | KITTI Pedestrian (test) | AP3D (Easy)15.33 | 63 | |
| 3D Object Detection | KITTI car (val) | AP 3D Easy24.52 | 62 | |
| 3D Object Detection | KITTI (test) | -- | 60 | |
| Bird's Eye View Object Detection (Car) | KITTI (test) | APBEV (Easy) @IoU=0.728.59 | 59 | |
| Bird's eye view object detection | KITTI (test) | APBEV@0.7 (Easy)28.59 | 53 | |
| 3D Object Detection | KITTI Cyclist (test) | AP3D Easy5.05 | 49 | |
| 3D Object Detection | KITTI official (test) | 3D AP (Easy)21.99 | 43 | |
| 3D Object Detection | KITTI (test) | 3D AP (Easy)21.99 | 43 | |
| Monocular 3D Object Detection | KITTI (test) | AP3D R40 (Mod.)15.39 | 38 |