VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

About

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy80.2	1429
Video Question Answering	MSRVTT-QA	Accuracy50.1	505
Text-to-Video Retrieval	DiDeMo	R@10.72	465
Video Question Answering	ActivityNet-QA	Accuracy50.4	418
Text-to-Video Retrieval	DiDeMo (test)	R@155.5	407
Text-to-Video Retrieval	MSR-VTT	Recall@164.4	406
Video Question Answering	MSVD-QA	Accuracy60.2	393
Text-to-Video Retrieval	MSR-VTT (test)	R@163.9	265
Text-to-Video Retrieval	ActivityNet	R@168.1	245
Text-to-Video Retrieval	LSMDC (test)	R@540.9	245

Showing 10 of 82 rows

...

Other info

Code

Follow for update

@wizwand_team Discord