MambaVision: A Hybrid Mamba-Transformer Vision Backbone

About

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

Ali Hatamizadeh, Jan Kautz• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU49.1	3069
Object Detection	COCO 2017 (val)	--	2843
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy80.1	2238
Instance Segmentation	COCO 2017 (val)	APm0.457	1275
Image Classification	ImageNet-1K	Top-1 Acc82.3	1239
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.6	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER5.8	1206
Semantic segmentation	ADE20K	mIoU49.1	1028
Image Classification	ImageNet-1k (val)	Top-1 Accuracy84.2	708
Image Classification	Food-101	--	570

Showing 10 of 41 rows

Other info

Code

Follow for update

@wizwand_team Discord