PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

About

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave• 2026

Related benchmarks

Task	Dataset	Result
3D Object Detection	nuScenes (val)	NDS56.2	249
3D Object Detection	Waymo Open Dataset (val)	3D APH Vehicle L255.8	219
3D Object Detection	SUN RGB-D	mAP@0.2564.8	107
3D Object Detection	KITTI (val)	Pedestrian AP (Easy)55.9	24

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord