Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

About

Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer.

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu• 2023

Related benchmarks

Task	Dataset	Result
3D Object Detection	nuScenes (val)	NDS33.5	217
3D Occupancy Prediction	Occ3D-nuScenes (val)	mIoU2.83e+3	213
3D Object Detection	nuScenes v1.0 (val)	mAP (Overall)31	207
LiDAR Semantic Segmentation	nuScenes official (test)	mIoU69.4	196
Semantic Scene Completion	SemanticKITTI (val)	mIoU (Mean IoU)11.36	84
Semantic Scene Completion	SemanticKITTI (test)	SSC mIoU11.26	67
3D Semantic Occupancy Prediction	SurroundOcc-nuScenes (val)	mIoU17.1	59
Semantic Scene Completion	SemanticKITTI official (test)	mIoU11.26	50
Semantic Occupancy Prediction	SemanticKITTI (test)	mIoU34.25	47
Semantic Scene Completion	SSCBench-KITTI-360 (test)	IoU40.22	43

Showing 10 of 38 rows

Other info

Code

Follow for update

@wizwand_team Discord