Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

About

Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80\% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench (val)
Accuracy80
210
Video Question AnsweringLongVideoBench
Accuracy66.9
180
Video Question AnsweringEgoSchema
Accuracy78.4
161
Video Question AnsweringMLVU
Accuracy50
143
Video Question AnsweringNextQA
Accuracy83
78
Long Video UnderstandingMLVU (test)--
60
Video Question AnsweringVideo-MME
Accuracy (Average, wo/ Subtitle)79.5
48
Long Video UnderstandingVideo-MME Long
Accuracy74.3
46
Long Video UnderstandingVideo-MME Overall
Accuracy81.7
39
Video ReasoningSAGE-Bench 1.0 (test)
Overall Score49.7
29
Showing 10 of 10 rows

Other info

Follow for update