Valley: Video Assistant with Large Language model Enhanced abilitY

About

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

Ruipu Luo, Ziwang Zhao, Min Yang, Zheming Yang, Minghui Qiu, Tao Wang, Zhongyu Wei, Yanhao Wang, Cen Chen• 2023

Related benchmarks

Task	Dataset	Result
Video Question Answering	MSRVTT-QA	Accuracy51.1	513
Video Question Answering	ActivityNet-QA	Accuracy45.1	438
Video Question Answering	MSVD-QA	Accuracy65.4	401
Video Question Answering	MSVD	Accuracy69.2	169
Highlight Detection	QVHighlights (test)	HIT@115.2	167
Temporal Video Understanding	TempCompass	--	160
Video Captioning	MSVD	CIDEr44.3	157
Temporal Video Grounding	Charades-STA (test)	Recall@IoU=0.51.8	139
Video Question Answering	MSRVTT	Accuracy50.8	117
Video-based generative performance	Video-ChatGPT benchmark	Correctness Score2.43	76

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord