Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ConvMAE: Masked Convolution Meets Masked Autoencoders

About

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao• 2022

Related benchmarks

TaskDatasetResultRank
Image ClassificationRESISC45--
472
Infrared Small Target DetectionIRSTD-1K
Pd77.9
188
Semantic segmentationMSRS
mIoU79
93
Image ClassificationImageNet-1K (fine-tuning)
Accuracy (FT)85
57
Remote Sensing Scene ClassificationRESISC45
Accuracy95
48
Object DetectionM3FD-inf (test)
mAP56.8
13
Object DetectionM3FD-IR (test)
mAP56.6
11
Semantic segmentationMSRS Infrared (test)
mIoU74.9
11
Semantic segmentationSODA-IR (test)
mIoU69.57
8
Semantic segmentationMFNet-IR (val)
mIoU50.29
8
Showing 10 of 14 rows

Other info

Follow for update