Rethinking Patch Dependence for Masked Autoencoders

About

In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io

Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU30.9	1028
Image Classification	ImageNet 1k (test)	Top-1 Accuracy84.3	880
Image Classification	ImageNet-1K	Top-1 Acc78.7	600
Image Classification	ImageNet-1K	Accuracy84.9	199
Image Classification	iNaturalist 18	Overall Accuracy77.7	151
Image Classification	VTAB	Overall Accuracy67.3	103
Image Classification	iNaturalist 2021	Top-1 Accuracy75.1	70
Image Classification	VTAB-6	Accuracy81.2	29
Image Classification	ImageNet-1K 1.0 (val)	1-shot Acc16.8	25
Semantic segmentation	ADE20K (train)	mIoU49.6	15

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord