Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

About

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan• 2022

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr133.9
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy78.33
664
Video Question AnsweringMSRVTT-QA
Accuracy44.1
481
Visual Question AnsweringVQA v2 (test-std)
Accuracy78.4
466
Image-to-Text RetrievalFlickr30K 1K (test)
R@197.3
439
Image ClassificationDTD
Accuracy76.5
419
Text-to-Video RetrievalDiDeMo (test)
R@152.4
376
Text-to-Image RetrievalFlickr30K 1K (test)
R@187.9
375
Video Question AnsweringMSRVTT-QA (test)
Accuracy44.1
371
Text-to-Video RetrievalDiDeMo
R@10.524
360
Showing 10 of 60 rows

Other info

Follow for update