Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition

About

Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq- length), deliver fast inference, and have low memory usage to meet real-time constraints. However, existing approaches prioritize performance at the expense of flexibility and effi- ciency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequen- tial frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt- STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% relative to our best comparable baseline. Our code is released at https://ai4ce.github.io/Adapt-STFormer/.

Yu Kiu (Idan) Lau, Chao Chen, Ge Jin, Chen Feng• 2025

Related benchmarks

TaskDatasetResultRank
Visual Place RecognitionNordland
Recall@197.63
123
Sequential Visual Place RecognitionnuScenes
Recall@155.51
11
Sequential Visual Place RecognitionOxford (Hard)
Recall@169.4
11
Sequential Visual Place RecognitionOxford Easy
Recall@188.82
11
Visual Place RecognitionNordland
Recall@599.23
4
Visual Place RecognitionOxford Easy
Recall@596.18
4
Visual Place RecognitionOxford (Hard)
Recall@576.59
4
Visual Place RecognitionnuScenes
Recall@570.31
4
Showing 8 of 8 rows

Other info

Follow for update