Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Valley: Video Assistant with Large Language model Enhanced abilitY

About

Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

Ruipu Luo, Ziwang Zhao, Min Yang, Zheming Yang, Minghui Qiu, Tao Wang, Zhongyu Wei, Yanhao Wang, Cen Chen• 2023

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy51.1
505
Video Question AnsweringActivityNet-QA
Accuracy45.1
418
Video Question AnsweringMSVD-QA
Accuracy65.4
393
Highlight DetectionQVHighlights (test)
HIT@115.2
167
Video CaptioningMSVD
CIDEr44.3
157
Video Question AnsweringMSVD
Accuracy69.2
152
Temporal Video UnderstandingTempCompass--
141
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.51.8
124
Video Question AnsweringMSRVTT
Accuracy50.8
100
Video-based generative performanceVideo-ChatGPT benchmark
Correctness Score2.43
76
Showing 10 of 29 rows

Other info

Follow for update