DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

About

Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at \url{https://github.com/Haiyang-W/DSVT}.

Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, Liwei Wang• 2023

Related benchmarks

Task	Dataset	Result
3D Object Detection	nuScenes (val)	NDS71.1	981
3D Object Detection	nuScenes (test)	mAP68.4	903
3D Object Detection	Waymo Open Dataset (val)	3D APH Vehicle L271	219
3D Object Detection	nuScenes (val)	NDS71.1	217
3D Object Detection	nuScenes v1.0-trainval (val)	NDS71.1	182
3D Object Detection	ONCE (val)	Overall mAP62.7	63
3D Object Detection	Waymo Open Dataset LEVEL_2 (val)	3D AP (Overall)74	60
3D Object Detection	Waymo Open Dataset LEVEL_1 (val)	--	60
3D Object Detection	KITTI (val)	mAP3D - Car (Easy)87.78	45
3D Object Detection	Waymo (val)	--	38

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord