Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Patch Dependence for Masked Autoencoders

About

In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io

Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU30.9
1024
Image ClassificationImageNet 1k (test)
Top-1 Accuracy84.3
848
Image ClassificationImageNet-1K
Top-1 Acc78.7
600
Image ClassificationImageNet-1K
Accuracy84.9
193
Image ClassificationiNaturalist 18
Overall Accuracy77.7
125
Image ClassificationVTAB
Overall Accuracy67.3
103
Image ClassificationiNaturalist 2021
Top-1 Accuracy75.1
70
Image ClassificationVTAB-6
Accuracy81.2
29
Image ClassificationImageNet-1K 1.0 (val)
1-shot Acc16.8
25
Semantic segmentationADE20K (train)
mIoU49.6
15
Showing 10 of 10 rows

Other info

Follow for update