VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

About

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in text-based reasoning with large language models, multi-modal reasoning - especially for videos - remains limited. In this work, we fill this gap by introducing VideoMind, a novel video-language agent for temporal-grounded video reasoning. Our method involves two key innovations: (1) We identify four essential capabilities for grounded video reasoning and propose a role-based agentic workflow, comprising a planner to coordinate roles, a grounder for temporal event localization, a verifier to assess event candidates, and an answerer for question answering. (2) To efficiently integrate these roles during inference, we propose a novel Chain-of-LoRA mechanism, where a unified base model with multiple LoRA adapters is leveraged to enable seamless role switching, balancing efficiency and flexibility. Extensive experiments on 15 benchmarks across Grounded VideoQA, Video Temporal Grounding, and General VideoQA tasks demonstrate the effectiveness of the proposed scheme in advancing video agent, test-time scaling, and long-form video reasoning. Code, models, datasets, and demos are available at https://videomind.github.io/.

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou• 2025

Related benchmarks

Task	Dataset	Result
Temporal Grounding	Charades-STA	mIoU50.2	120
Video Grounding	Charades-STA	R@1 IoU=0.559.1	113
Video Question Answering	LVBench	Accuracy40.8	108
Temporal Grounding	ActivityNet Captions	Recall@1 (IoU=0.5)30.3	85
Video Question Answering	MLVU	M-Avg Score64.4	80
Grounded Video Question Answering	NExT-GQA	mIoU31.4	69
Video Grounding	QVHighlights (test)	mAP (IoU=0.5)74.11	64
Video Temporal Grounding	ActivityNet Captions	Recall @ IoU=0.348.4	47
Long Video Question Answering	MLVU	M-Avg64.4	46
Grounded Video Question Answering	NExT-GQA (test)	Acc@GQA28.2	45

Showing 10 of 38 rows

Other info

Code

Follow for update

@wizwand_team Discord