SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

About

Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of both accuracy and speed with a super-light model size (0.9M). Meanwhile, two simplified versions of SVT-Net are introduced, which also achieve state-of-the-art and further reduce the model size to 0.8M and 0.4M respectively.

Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Zhiwu Lu, Jun He, Xiaoyong Du• 2021

Related benchmarks

Task	Dataset	Result
Place Recognition	Oxford RobotCar	Avg Recall @ 1%98	58
Place Recognition	Oxford	AR@1%98.4	42
Place Recognition	R.A.	AR@1 (%)99.5	40
Place Recognition	B.D.	AR@1%97.2	40
Place Recognition	University Sectors (U.S.)	Recall@1%99.9	30
Place Recognition	U.S.	AR@1%99.9	20
Place Recognition	NCLT	AR@186.45	13
Place Recognition	Residential Area (R.A.)	Avg Recall @ 1%92.7	10
Place Recognition	Business District (B.D.)	Recall@1%90.7	10
Place Recognition	Oxford (test)	Recall@1%98.6	10

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord