Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

About

Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

Tianhong Li, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan• 2022

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet (val)
Top-1 Acc84.3
1206
Conditional Image GenerationImageNet-1K 256x256 (val)
gFID6.93
86
Image GenerationImageNet 256x256 (test val)
FID6.93
35
Class-unconditional image generationImageNet 256x256
FID9.1
25
Conditional Image GenerationImageNet 256x256 (train val)--
24
Conditional Image GenerationImageNet 256x256 1.0 (train val)
FID6.93
23
Unconditional Image GenerationImageNet 256x256 (train)
FID7.04
21
Unconditional Image GenerationImageNet (val)
FID7.04
12
Image Representation LearningImageNet 1k (train)
Accuracy78.9
6
Showing 9 of 9 rows

Other info

Follow for update