Top-down Neural Attention by Excitation Backprop

About

We aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. In experiments, we demonstrate the accuracy and generalizability of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images.

Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff• 2016

Related benchmarks

Task	Dataset	Result
Pointing localization	VOC 2007 (test)	Mean Accuracy (All)90.7	44
Pointing game	MSCOCO 2014 (val)	Mean Accuracy (All)58.5	42
Phrase Localization	Flickr30K Entities (test)	--	35
Phrase Localization	VisualGenome (VG) (test)	Pointing Accuracy19.31	29
Pointing localization	VOC Difficult 2007 (test)	Accuracy72.3	21
Phrase grounding	Flickr30K	--	20
Phrase grounding	ReferIt (test)	Pointing Accuracy31.97	18
Visual Grounding	ReferIt	Pointing Game Accuracy31.97	16
Weakly Supervised Grounding	Visual Genome (VG) (test)	Accuracy (Pointing Game)19.31	15
Weakly Supervised Grounding	Flickr30k (test)	Accuracy (Pointing Game)42.4	14

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord