SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition
About
Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of both accuracy and speed with a super-light model size (0.9M). Meanwhile, two simplified versions of SVT-Net are introduced, which also achieve state-of-the-art and further reduce the model size to 0.8M and 0.4M respectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Place Recognition | Oxford RobotCar | Avg Recall @ 1%98 | 43 | |
| Place Recognition | Oxford | AR@1%98.4 | 42 | |
| Place Recognition | R.A. | AR@1 (%)99.5 | 40 | |
| Place Recognition | B.D. | AR@1%97.2 | 40 | |
| Place Recognition | University Sectors (U.S.) | Recall@1%99.9 | 30 | |
| Place Recognition | U.S. | AR@1%99.9 | 20 | |
| Place Recognition | Residential Area (R.A.) | Avg Recall @ 1%92.7 | 10 | |
| Place Recognition | Business District (B.D.) | Recall@1%90.7 | 10 | |
| Place Recognition | Oxford (test) | Recall@1%98.6 | 10 | |
| Place Recognition | U.S. University Sector (test) | Avg Recall @ 1%99.9 | 10 |