Top-down Neural Attention by Excitation Backprop
About
We aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. In experiments, we demonstrate the accuracy and generalizability of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pointing localization | VOC 2007 (test) | Mean Accuracy (All)90.7 | 44 | |
| Pointing game | MSCOCO 2014 (val) | Mean Accuracy (All)58.5 | 42 | |
| Phrase Localization | Flickr30K Entities (test) | -- | 35 | |
| Phrase Localization | VisualGenome (VG) (test) | Pointing Accuracy19.31 | 29 | |
| Pointing localization | VOC Difficult 2007 (test) | Accuracy72.3 | 21 | |
| Phrase grounding | Flickr30K | -- | 20 | |
| Phrase grounding | ReferIt (test) | Pointing Accuracy31.97 | 18 | |
| Visual Grounding | ReferIt | Pointing Game Accuracy31.97 | 16 | |
| Weakly Supervised Grounding | Visual Genome (VG) (test) | Accuracy (Pointing Game)19.31 | 15 | |
| Weakly Supervised Grounding | Flickr30k (test) | Accuracy (Pointing Game)42.4 | 14 |