VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

About

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li• 2024

Related benchmarks

Task	Dataset	Result
Video Understanding	VideoMME	Score (Overall)56	357
Video Question Answering	NExT-QA (test)	--	204
Video Question Answering	MLVU	Accuracy34.3	194
Video Question Answering	EgoSchema	Accuracy60.2	161
Video Question Answering	EgoSchema (test)	Accuracy62.8	90
Video Question Answering	NextQA	Accuracy66.1	78
Video Understanding	LVBench	--	75
Video Question Answering	EgoSchema 500-question subset	Accuracy62.8	50
Long Video Understanding	LVBench (test)	LVBench Score29.3	43
Video Reasoning	SAGE-Bench 1.0 (test)	Overall Score42	29

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord