Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Distilled Dual-Encoder Model for Vision-Language Understanding

About

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering. Dual-encoder models have a faster inference speed than fusion-encoder models and enable the pre-computation of images and text during inference. However, the shallow interaction module used in dual-encoder models is insufficient to handle complex vision-language understanding tasks. In order to learn deep interactions of images and text, we introduce cross-modal attention distillation, which uses the image-to-text and text-to-image attention distributions of a fusion-encoder model to guide the training of our dual-encoder model. In addition, we show that applying the cross-modal attention distillation for both pre-training and fine-tuning stages achieves further improvements. Experimental results demonstrate that the distilled dual-encoder model achieves competitive performance for visual reasoning, visual entailment and visual question answering tasks while enjoying a much faster inference speed than fusion-encoder models. Our code and models will be publicly available at https://github.com/kugwzk/Distilled-DualEncoder.

Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei• 2021

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy68.05
337
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy75.6
327
Visual EntailmentSNLI-VE (test)
Overall Accuracy76.3
197
Visual Question AnsweringVQA (test-std)
Accuracy69.2
110
Visual ReasoningNLVR2 (test)
Accuracy74.3
44
Natural Language Visual ReasoningNLVR2 (val)
Accuracy74.16
12
Showing 6 of 6 rows

Other info

Follow for update