Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LocalMamba: Visual State Space Model with Windowed Selective Scan

About

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU47.9
2888
Object DetectionCOCO 2017 (val)--
2643
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy83.7
1952
Image ClassificationImageNet-1K
Top-1 Acc83.7
1239
Instance SegmentationCOCO 2017 (val)
APm0.422
1201
Automatic Speech RecognitionLibriSpeech clean (test)
WER6.8
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER12.9
1151
Semantic segmentationADE20K--
1024
Image ClassificationImageNet 1k (test)
Top-1 Accuracy83.7
450
Object DetectionCOCO 2017
AP (Box)48.4
321
Showing 10 of 23 rows

Other info

Follow for update