Optimizing Relevance Maps of Vision Transformers Improves Robustness

About

It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.

Hila Chefer, Idan Schwartz, Lior Wolf• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet (val)	Top-1 Accuracy85.4	188
Image Classification	ImageNet-A (test)	Top-1 Acc42.4	177
Image Classification	ImageNet-R (test)	Accuracy54	170
Image Classification	ImageNet-Sketch (test)	Top-1 Acc0.542	153
Image Classification	ImageNet-W	IN-W Gap-7.3	74
Image Classification	ImageNet matched frequency V2 (test)	Top-1 Accuracy76.1	62
Image Classification	ImageNet-1K	IN-1k Acc80.3	51
Image Classification	ObjectNet (test)	R@152	43
Robustness Evaluation	SI-Score location synthetic	R@148.3	31
Robustness Evaluation	SI-Score rotation synthetic	R@158	31

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord