VideoAgent: Long-form Video Understanding with Large Language Model as Agent
About
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | NExT-QA (test) | Accuracy71.3 | 204 | |
| Video Question Answering | EgoSchema (Full) | Accuracy54.1 | 193 | |
| Video Question Answering | NExT-QA (val) | Overall Acc71.3 | 176 | |
| Video Question Answering | NEXT-QA | -- | 105 | |
| Video Question Answering | NExT-QA Multi-choice | Accuracy71.3 | 102 | |
| Video Question Answering | EgoSchema | Accuracy60.2 | 88 | |
| Video Question Answering | EgoSchema (test) | Accuracy60.2 | 80 | |
| Video Question Answering | EgoSchema subset | Accuracy63.8 | 73 | |
| Long Video Understanding | LVBench | Accuracy29.3 | 63 | |
| Multiple-choice Video Question Answering | EgoSchema | Accuracy60.2 | 61 |