Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

iBOT: Image BERT Pre-Training with Online Tokenizer

About

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong• 2021

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU52.3
2731
Object DetectionCOCO 2017 (val)
AP51.2
2454
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy85.2
1866
Image ClassificationImageNet-1k (val)
Top-1 Accuracy79.5
1453
Image ClassificationImageNet (val)--
1206
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)75.1
1155
Instance SegmentationCOCO 2017 (val)
APm0.442
1144
Video Object SegmentationDAVIS 2017 (val)
J mean61.7
1130
Semantic segmentationADE20K
mIoU50
936
Image ClassificationImageNet-1K
Top-1 Acc84.8
836
Showing 10 of 135 rows
...

Other info

Code

Follow for update