Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

About

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention mechanisms. These attention modules also play a role in other computer vision tasks including object detection and image segmentation. Unlike Transformers that only use self-attention, Transformers with co-attention require to consider multiple attention maps in parallel in order to highlight the information that is relevant to the prediction in the model's input. In this work, we propose the first method to explain prediction by any Transformer-based architecture, including bi-modal Transformers and Transformers with co-attentions. We provide generic solutions and apply these to the three most commonly used of these architectures: (i) pure self-attention, (ii) self-attention combined with co-attention, and (iii) encoder-decoder attention. We show that our method is superior to all existing methods which are adapted from single modality explainability.

Hila Chefer, Shir Gur, Lior Wolf• 2021

Related benchmarks

Task	Dataset	Result
Image-to-Text Retrieval	MS-COCO (test)	R@120.97	127
Anomaly Segmentation	MVTec-AD (test)	--	85
Text-to-Image Retrieval	MS-COCO (test)	R@115.37	82
Localization	ImageNet-1k (val)	EHR0.297	79
Feature Importance Assessment	ImageNet-1k (val)	Insertion Score34.71	78
Weakly Supervised Object Localization	CUB-200-2011 (test)	Accuracy68.01	38
Perturbation-based evaluation	VQA v2	Positive Perturbation AUC1.05	34
Perturbation-based evaluation	ImageNet	Pos AUC11.15	34
Feature Attribution Evaluation	ImageNet-1k (val)	MoRF Score30.06	33
Faithfulness evaluation of image explanation	ImageNet (val)	Deletion59.4	32

Showing 10 of 68 rows

Other info

Follow for update

@wizwand_team Discord