Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

About

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in text-based reasoning with large language models, multi-modal reasoning - especially for videos - remains limited. In this work, we fill this gap by introducing VideoMind, a novel video-language agent for temporal-grounded video reasoning. Our method involves two key innovations: (1) We identify four essential capabilities for grounded video reasoning and propose a role-based agentic workflow, comprising a planner to coordinate roles, a grounder for temporal event localization, a verifier to assess event candidates, and an answerer for question answering. (2) To efficiently integrate these roles during inference, we propose a novel Chain-of-LoRA mechanism, where a unified base model with multiple LoRA adapters is leveraged to enable seamless role switching, balancing efficiency and flexibility. Extensive experiments on 15 benchmarks across Grounded VideoQA, Video Temporal Grounding, and General VideoQA tasks demonstrate the effectiveness of the proposed scheme in advancing video agent, test-time scaling, and long-form video reasoning. Code, models, datasets, and demos are available at https://videomind.github.io/.

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou• 2025

Related benchmarks

TaskDatasetResultRank
Video GroundingCharades-STA
R@1 IoU=0.559.1
113
Video GroundingQVHighlights (test)
mAP (IoU=0.5)74.11
64
Video Question AnsweringLVBench
Accuracy40.8
50
Grounded Video Question AnsweringCG-Bench
mIoU7.1
31
Video ReasoningSAGE-Bench 1.0 (test)
Overall Score50
29
Grounded Video Question AnsweringNExT-GQA
mIoU31.4
28
Video Question AnsweringMVBench 1.0 (test)
AS Score77
25
Video Question AnsweringMLVU
M-Avg Score64.4
16
Grounded Video Question AnsweringReXTime
mIoU27.61
14
Video Question AnsweringMLVU long-form
Accuracy64.4
14
Showing 10 of 18 rows

Other info

Code

Follow for update