VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

About

Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%). The source code is available at https://github.com/PanasonicConnect/VideoMultiAgents.

Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, Ehsan Adeli• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	EgoSchema (Full)	Accuracy68	241
Video Question Answering	EgoSchema subset	Accuracy75.4	124
Video Question Answering	NExT-QA Main Dataset	Accuracy0.796	48
Long-form Video Question Answering	EgoSchema	Accuracy68	24
Video Question Answering	IntentQA (test)	Top-1 Accuracy79	22
Video Question Answering	NextQA (val)	Accuracy79.6	11

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord