Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

About

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.

Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringNExT-QA (test)
Accuracy71.3
204
Video Question AnsweringEgoSchema (Full)
Accuracy54.1
193
Video Question AnsweringNExT-QA (val)
Overall Acc71.3
176
Video Question AnsweringNEXT-QA--
105
Video Question AnsweringNExT-QA Multi-choice
Accuracy71.3
102
Video Question AnsweringEgoSchema
Accuracy60.2
88
Video Question AnsweringEgoSchema (test)
Accuracy60.2
80
Video Question AnsweringEgoSchema subset
Accuracy63.8
73
Long Video UnderstandingLVBench
Accuracy29.3
63
Multiple-choice Video Question AnsweringEgoSchema
Accuracy60.2
61
Showing 10 of 40 rows

Other info

Code

Follow for update