Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

About

The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of multimodal generative tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous state-of-the-art zero-shot model with $7\times$ fewer parameters. Furthermore, the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.

Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, Pascale Fung• 2022

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy69.8	712
Image Captioning	MS COCO Karpathy (test)	CIDEr0.583	706
Visual Question Answering	OK-VQA (test)	Accuracy13.3	327
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)42.6	183
Visual Question Answering	VQA v2 (val)	Accuracy42.6	158
Visual Question Answering	VQAv2 (test)	VQA Accuracy44.5	82
Visual Question Answering	OK-VQA (val)	Accuracy13.3	47
Visual Question Answering	VQA 2.0 (test)	Accuracy38.6	24
Visual Question Answering	VQA Karpathy (test)	Overall Accuracy69.2	21
Visual Question Answering	VQAv2 (val)	Accuracy (Overall)42.6	21

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord