COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
About
Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations. To address this issue, we propose COSA, a COncatenated SAmple pretrained vision-language foundation model. COSA jointly models visual contents and event-level temporal cues using only image-text corpora. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus, enabling richer scene transformations and explicit event-description correspondence. Extensive experiments demonstrate that COSA consistently improves performance across a broad range of downstream tasks, including long-form/short-form video-text tasks and image-text tasks such as retrieval, captioning, and question answering. Notably, COSA achieves state-of-the-art results on various competitive benchmarks. Code and model are released at https://github.com/TXH-mercury/COSA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30K | R@190.2 | 559 | |
| Video Question Answering | MSRVTT-QA | Accuracy49.2 | 505 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy80.54 | 486 | |
| Video Question Answering | ActivityNet-QA | Accuracy49.9 | 418 | |
| Text-to-Video Retrieval | DiDeMo (test) | R@170.5 | 407 | |
| Video Question Answering | MSVD-QA | Accuracy60 | 393 | |
| Text-to-Video Retrieval | LSMDC (test) | R@560.4 | 245 | |
| Text-to-Video Retrieval | MSRVTT (test) | Recall@50.796 | 178 | |
| Video Question Answering | TGIF-QA | Accuracy79.5 | 156 | |
| Image-to-Text Retrieval | MSCOCO | R@168.5 | 152 |