Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Causal Attention for Vision-Language Tasks

About

We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai• 2021

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr1.317
682
Visual Question AnsweringVQA v2 (test-std)
Accuracy73.63
466
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy73.54
337
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy77.23
327
Visual Question AnsweringGQA (test-dev)
Accuracy61.87
178
Image CaptioningMS-COCO (test)--
117
Visual Question AnsweringGQA (test-std)
Accuracy62.07
62
Image CaptioningMS COCO (Karpathy)
CIDEr-D131.7
56
Visual Question AnsweringVQA loc v2.0 (val)
Accuracy67.33
7
Visual ReasoningNLVR2 loc (val)
Accuracy77.27
5
Showing 10 of 10 rows

Other info

Code

Follow for update