Point Virtual Transformer

About

LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.

Veerain Sood, Bnalin, Gaurav Pandey• 2026

Related benchmarks

Task	Dataset	Result
3D Object Detection	KITTI car (test)	AP3D (Easy)91.16	226
Bird's Eye View Detection	KITTI Car class official (test)	AP (Easy)95.94	62
Object Detection	KITTI (test)	AP Overall Easy99.36	35

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord