Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VMamba: Visual State Space Model

About

Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, Yunfan Liu• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU50.6
3069
Object DetectionCOCO 2017 (val)--
2843
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy83.9
2238
Instance SegmentationCOCO 2017 (val)
APm0.442
1275
Image ClassificationImageNet-1K
Top-1 Acc84.5
1239
Automatic Speech RecognitionLibriSpeech clean (test)
WER6.9
1207
Automatic Speech RecognitionLibriSpeech (test-other)
WER13.1
1206
Semantic segmentationADE20K
mIoU41.68
1028
Image ClassificationImageNet-1k (val)
Top-1 Accuracy83.9
708
Semantic segmentationCityscapes
mIoU79.03
668
Showing 10 of 89 rows
...

Other info

Code

Follow for update