Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

About

Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although existing Large Language Model (LLM)-based approaches have advanced long video understanding, they remain bottlenecked by task-agnostic, fixed-granularity perception pipelines and suffer from vision-language hallucinations. Inspired by human adaptive perception and active verification, we propose CogniGPT, a framework leveraging an interactive loop between a Multi-Granular Perception Agent (MPA) and an Active Verification Agent (AVA). Specifically, instead of predetermined heuristics, MPA adaptively determines the optimal perception granularity and strategy based on the evolving context, while AVA actively mines multi-perspective visual evidence to cross-verify key observations and eliminate hallucinations. This interaction allows CogniGPT to efficiently identify a minimal set of reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat demonstrate its superiority in accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

Jiahua Li, Zhanhe Zhang, Chenghao Xu, Zhe Xu, Kun Wei, Xu Yang, Cheng Deng• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	Video-MME Long	Accuracy54.7	71
Video Question Answering	EgoSchema (official)	Accuracy74	16
Needle-in-a-Haystack	Haystack-LVBench	Precision2.8	4
Video Question Answering	MovieChat (test)	Accuracy95.3	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord