Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

X-Linear Attention Networks for Image Captioning

About

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at \url{https://github.com/Panda-Peter/image-captioning}.

Yingwei Pan, Ting Yao, Yehao Li, Tao Mei• 2020

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr1.372
682
Image CaptioningMS-COCO (test)
CIDEr74.1
117
Image CaptioningMS COCO (Karpathy)
CIDEr-D132
56
Image CaptioningMS-COCO online (test)
BLEU-4 (c5)40.3
49
Image CaptioningMS-COCO 2014 (test)
BLEU-472.4
43
Image CaptioningConceptual Captions (test)
CIDEr39.5
34
Image CaptioningCOCO c5 references online (test)
BLEU-181.9
24
Image CaptioningMS-COCO Karpathy 2014 (test)
BLEU-439.5
24
Image CaptioningMSCOCO (test server)
BLEU-4 (c5)40.3
22
Image CaptioningMS COCO 40,775 images (test)
CIDEr133.5
15
Showing 10 of 14 rows

Other info

Code

Follow for update