Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

About

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by $\textbf{+1.3\%}$ under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (\textbf{88.3\%}) among methods using only ImageNet-1K data.

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, Baining Guo• 2021

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU48.5
2731
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy84.5
1866
Instance SegmentationCOCO 2017 (val)--
1144
Semantic segmentationADE20K
mIoU48.5
936
Image ClassificationImageNet 1k (test)
Top-1 Accuracy84.5
798
Image ClassificationImageNet-1K
Top-1 Acc88.3
524
Image ClassificationImageNet-1k (val)
Top-1 Accuracy86.5
512
Instance SegmentationCOCO
APmask42.6
279
Object DetectionMS-COCO 2017 (val)--
237
Image ClassificationImageNet 1K (train val)
Top-1 Accuracy84.5
107
Showing 10 of 12 rows

Other info

Follow for update