Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

About

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.2
1165
Video Question AnsweringMSRVTT-QA
Accuracy50.1
481
Text-to-Video RetrievalDiDeMo (test)
R@155.5
376
Text-to-Video RetrievalDiDeMo
R@10.72
360
Video Question AnsweringMSVD-QA
Accuracy60.2
340
Video Question AnsweringActivityNet-QA
Accuracy50.4
319
Text-to-Video RetrievalMSR-VTT
Recall@156.6
313
Text-to-Video RetrievalMSR-VTT (test)
R@163.9
234
Text-to-Video RetrievalLSMDC (test)
R@123.2
225
Text-to-Video RetrievalMSVD (test)
R@150.6
204
Showing 10 of 72 rows
...

Other info

Code

Follow for update