Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

About

Despite the significant achievements of Vision Transformers (ViTs) in various vision tasks, they are constrained by the quadratic complexity. Recently, State Space Models (SSMs) have garnered widespread attention due to their global receptive field and linear complexity with respect to the input length, demonstrating substantial potential across fields including natural language processing and computer vision. To improve the performance of SSMs in vision tasks, a multi-scan strategy is widely adopted, which leads to significant redundancy of SSMs. For a better trade-off between efficiency and performance, we analyze the underlying reasons behind the success of the multi-scan strategy, where long-range dependency plays an important role. Based on the analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the superiority of SSMs in vision tasks with limited parameters. It employs a multi-scale 2D scanning technique on both original and downsampled feature maps, which not only benefits long-range dependency learning but also reduces computational costs. Additionally, we integrate a Convolutional Feed-Forward Network (ConvFFN) to address the lack of channel mixing. Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU with single-scale testing on ADE20K.Code is available at \url{https://github.com/YuHengsss/MSVMamba}.

Yuheng Shi, Minjing Dong, Chang Xu• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)--
2888
Object DetectionCOCO 2017 (val)--
2643
Image ClassificationImageNet-1K
Top-1 Acc83
1239
Instance SegmentationCOCO 2017 (val)
APm0.422
1201
Semantic segmentationADE20K
mIoU40.7
1024
Semantic segmentationCityscapes
mIoU78.4
658
Image ClassificationImageNet 1k (test)
Top-1 Accuracy82.8
450
Semantic segmentationCOCO Stuff
mIoU36.63
379
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy79.8
229
Image ClassificationImageNet-100 (val)
Top-1 Accuracy88.44
205
Showing 10 of 15 rows

Other info

Code

Follow for update