Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

About

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringEgoSchema 500-question subset
Accuracy62.8
50
Video ReasoningSAGE-Bench 1.0 (test)
Overall Score42
29
Long-form Egocentric Video UnderstandingEgoSchema
Accuracy63.2
25
Showing 3 of 3 rows

Other info

Follow for update