Point Virtual Transformer
About
LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Object Detection | KITTI car (test) | AP3D (Easy)91.16 | 195 | |
| Bird's Eye View Detection | KITTI Car class official (test) | AP (Easy)95.94 | 62 | |
| Object Detection | KITTI (test) | AP Overall Easy99.36 | 35 |