Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Optimizing Relevance Maps of Vision Transformers Improves Robustness

About

It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.

Hila Chefer, Idan Schwartz, Lior Wolf• 2022

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet (val)
Top-1 Accuracy85.4
188
Image ClassificationImageNet-A (test)
Top-1 Acc42.4
154
Image ClassificationImageNet-Sketch (test)
Top-1 Acc0.542
132
Image ClassificationImageNet-R (test)
Accuracy54
105
Image ClassificationImageNet-W
IN-W Gap-7.3
74
Image ClassificationImageNet matched frequency V2 (test)
Top-1 Accuracy76.1
62
Image ClassificationImageNet-1K
IN-1k Acc80.3
51
Image ClassificationObjectNet (test)
R@152
43
Robustness EvaluationSI-Score location synthetic
R@148.3
31
Robustness EvaluationSI-Score rotation synthetic
R@158
31
Showing 10 of 11 rows

Other info

Code

Follow for update