MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

About

Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP

Yunze Liu, Li Yi• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU46.9	3069
Object Detection	COCO 2017 (val)	--	2843
Instance Segmentation	COCO 2017 (val)	--	1275
3D Object Classification	ModelNet40 (test)	Accuracy95.9	321
Part Segmentation	ShapeNetPart	mIoU (Instance)86.3	246
Image Classification	ImageNet-1k (val)	Accuracy86.4	199
Few-shot 3D Classification	ModelNet40 (test)	Accuracy98.7	92
3D Object Classification	ScanObjectNN OBJ-ONLY (test)	Accuracy94.97	49
3D Classification	ScanObjectNN PB-T50-RS official	Accuracy93.87	42
3D Classification	ScanObjectNN OBJ_BG (test)	Accuracy95.84	36

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord