Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

About

Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although existing Large Language Model (LLM)-based approaches have advanced long video understanding, they remain bottlenecked by task-agnostic, fixed-granularity perception pipelines and suffer from vision-language hallucinations. Inspired by human adaptive perception and active verification, we propose CogniGPT, a framework leveraging an interactive loop between a Multi-Granular Perception Agent (MPA) and an Active Verification Agent (AVA). Specifically, instead of predetermined heuristics, MPA adaptively determines the optimal perception granularity and strategy based on the evolving context, while AVA actively mines multi-perspective visual evidence to cross-verify key observations and eliminate hallucinations. This interaction allows CogniGPT to efficiently identify a minimal set of reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat demonstrate its superiority in accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

Jiahua Li, Zhanhe Zhang, Chenghao Xu, Zhe Xu, Kun Wei, Xu Yang, Cheng Deng• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringVideo-MME Long
Accuracy54.7
41
Video Question AnsweringEgoSchema (official)
Accuracy74
16
Needle-in-a-HaystackHaystack-LVBench
Precision2.8
4
Video Question AnsweringMovieChat (test)
Accuracy95.3
4
Showing 4 of 4 rows

Other info

Follow for update