MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining
About
Hybrid Mamba-Transformer networks have recently garnered broad attention. These networks can leverage the scalability of Transformers while capitalizing on Mamba's strengths in long-context modeling and computational efficiency. However, the challenge of effectively pretraining such hybrid networks remains an open question. Existing methods, such as Masked Autoencoders (MAE) or autoregressive (AR) pretraining, primarily focus on single-type network architectures. In contrast, pretraining strategies for hybrid architectures must be effective for both Mamba and Transformer components. Based on this, we propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. This strategy combines the strengths of both MAE and Autoregressive pretraining, improving the performance of Mamba and Transformer modules within a unified paradigm. Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies, achieving state-of-the-art performance. We validate the method's effectiveness on both 2D and 3D datasets and provide detailed ablation studies to support the design choices for each component. The code and checkpoints are available at https://github.com/yunzeliu/MAP
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU46.9 | 2731 | |
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1144 | |
| 3D Object Classification | ModelNet40 (test) | Accuracy95.9 | 302 | |
| Part Segmentation | ShapeNetPart | mIoU (Instance)86.3 | 198 | |
| Image Classification | ImageNet-1k (val) | Accuracy86.4 | 189 | |
| Few-shot 3D Classification | ModelNet40 (test) | Accuracy98.7 | 92 | |
| 3D Object Classification | ScanObjectNN OBJ-ONLY (test) | Accuracy94.97 | 49 | |
| 3D Classification | ScanObjectNN PB-T50-RS official | Accuracy93.87 | 42 | |
| 3D Classification | ScanObjectNN OBJ_BG (test) | Accuracy95.84 | 36 |