Leveraging Transformer Decoder for Automotive Radar Object Detection
About
In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 2D Object Detection | RADDet Range-Doppler map | AP@0.555.91 | 7 | |
| 3D Object Detection | RADDet (test) | AP@0.453.75 | 7 | |
| 2D Object Detection | RADDet Range-Azimuth map | AP@0.50.5538 | 7 |