Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

About

Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80\% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench (val)
Accuracy80
139
Long Video UnderstandingMLVU (test)--
41
Long Video UnderstandingVideo-MME Overall
Accuracy81.7
39
Long Video UnderstandingVideo-MME Long
Accuracy74.3
37
Video ReasoningSAGE-Bench 1.0 (test)
Overall Score49.7
29
Showing 5 of 5 rows

Other info

Follow for update