Rethinking Patch Dependence for Masked Autoencoders
About
In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K | mIoU30.9 | 1024 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy84.3 | 848 | |
| Image Classification | ImageNet-1K | Top-1 Acc78.7 | 600 | |
| Image Classification | ImageNet-1K | Accuracy84.9 | 193 | |
| Image Classification | iNaturalist 18 | Overall Accuracy77.7 | 125 | |
| Image Classification | VTAB | Overall Accuracy67.3 | 103 | |
| Image Classification | iNaturalist 2021 | Top-1 Accuracy75.1 | 70 | |
| Image Classification | VTAB-6 | Accuracy81.2 | 29 | |
| Image Classification | ImageNet-1K 1.0 (val) | 1-shot Acc16.8 | 25 | |
| Semantic segmentation | ADE20K (train) | mIoU49.6 | 15 |